Skip to content

Materials data science: descriptors and machine learning

Welcome to the materials data science lesson. In this session, we will demonstrate how to use matminer, automatminer, pandas and scikit-learn for machine learning materials properties.

The lesson is split into four sections: 1. Data retrieval and basic analysis of pandas DataFrame objects. 2. Generating machine learnable descriptors. 3. Training, testing and visualizing machine learning methods with scikit-learn and FigRecipes. 4. Automating steps 2 and 3 using automatminer.

Many more tutorials on how to use matminer (beyond the scope of this workshop) are available in the matminer_examples repository, available here.

Machine learning workflow

Firstly, what does a typical machine learning workflow look like? The overall process can be summarized as: 1. Take raw inputs, such as a list of compositions, and an associated target property to learn. 2. Convert the raw inputs into descriptors or features that can be learned by machine learning algorithms. 3. Train a machine learning model on the data. 4. Plot and analyze the performance of the model.

machine learning workflow

Typically, questions asked by a new practitioner in the field include: - Where do we get the raw data from? - How do we convert the raw data into learnable features? - How can we plot and interpret the results of a model?

The matminer package has been developed to help make machine learning of materials properties easy and hassle free. The aim of matminer is to connect materials data with data mining algorithms and data visualization.

matminer overview

Part 1: Data retrieval and filtering

Matminer interfaces with many materials databases, including: - Materials Project - Citrine - AFLOW - Materials Data Facility (MDF) - Materials Platform for Data Science (MPDS)

In addition, it also includes datasets from published literature. Matminer hosts a repository of 26 (and growing) datasets which comes from published and peer-reviewed machine learning investigations of materials properties or publications of high-throughput computing studies.

In this section, we will show how to access and manipulate the datasets from the published literature. More information on accessing other materials databases are detailed in the matminer_examples repository.

A list of the literature-based datasets can be printed using the get_available_datasets() function. This also prints information about what the dataset contains, such as the number of samples, the target properties, and how the data was obtained (e.g., via theory or experiment).

from matminer.datasets import get_available_datasets

get_available_datasets()
boltztrap_mp: Effective mass and thermoelectric properties of 8924 compounds in The  Materials Project database that are calculated by the BoltzTraP software package run on the GGA-PBE or GGA+U density functional theory calculation results. The properties are reported at the temperature of 300 Kelvin and the carrier concentration of 1e18 1/cm3.

brgoch_superhard_training: 2574 materials used for training regressors that predict shear and bulk modulus.

castelli_perovskites: 18,928 perovskites generated with ABX combinatorics, calculating gllbsc band gap and pbe structure, and also reporting absolute band edge positions and heat of formation.

citrine_thermal_conductivity: Thermal conductivity of 872 compounds measured experimentally and retrieved from Citrine database from various references. The reported values are measured at various temperatures of which 295 are at room temperature.

dielectric_constant: 1,056 structures with dielectric properties, calculated with DFPT-PBE.

double_perovskites_gap: Band gap of 1306 double perovskites (a_1-b_1-a_2-b_2-O6) calculated using Gritsenko, van Leeuwen, van Lenthe and Baerends potential (gllbsc) in GPAW.

double_perovskites_gap_lumo: Supplementary lumo data of 55 atoms for the double_perovskites_gap dataset.

elastic_tensor_2015: 1,181 structures with elastic properties calculated with DFT-PBE.

expt_formation_enthalpy: Experimental formation enthalpies for inorganic compounds, collected from years of calorimetric experiments. There are 1,276 entries in this dataset, mostly binary compounds. Matching mpids or oqmdids as well as the DFT-computed formation energies are also added (if any).

expt_gap: Experimental band gap of 6354 inorganic semiconductors.

flla: 3938 structures and computed formation energies from "Crystal Structure Representations for Machine Learning Models of Formation Energies."

glass_binary: Metallic glass formation data for binary alloys, collected from various experimental techniques such as melt-spinning or mechanical alloying. This dataset covers all compositions with an interval of 5 at. % in 59 binary systems, containing a total of 5959 alloys in the dataset. The target property of this dataset is the glass forming ability (GFA), i.e. whether the composition can form monolithic glass or not, which is either 1 for glass forming or 0 for non-full glass forming.

glass_binary_v2: Identical to glass_binary dataset, but with duplicate entries merged. If there was a disagreement in gfa when merging the class was defaulted to 1.

glass_ternary_hipt: Metallic glass formation dataset for ternary alloys, collected from the high-throughput sputtering experiments measuring whether it is possible to form a glass using sputtering. The hipt experimental data are of the Co-Fe-Zr, Co-Ti-Zr, Co-V-Zr and Fe-Ti-Nb ternary systems.

glass_ternary_landolt: Metallic glass formation dataset for ternary alloys, collected from the "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. This dataset contains experimental measurements of whether it is possible to form a glass using a variety of processing techniques at thousands of compositions from hundreds of ternary systems. The processing techniques are designated in the "processing" column. There are originally 7191 experiments in this dataset, will be reduced to 6203 after deduplicated, and will be further reduced to 6118 if combining multiple data for one composition. There are originally 6780 melt-spinning experiments in this dataset, will be reduced to 5800 if deduplicated, and will be further reduced to 5736 if combining multiple experimental data for one composition.

heusler_magnetic: 1153 Heusler alloys with DFT-calculated magnetic and electronic properties. The 1153 alloys include 576 full, 449 half and 128 inverse Heusler alloys. The data are extracted and cleaned (including de-duplicating) from Citrine.

jarvis_dft_2d: Various properties of 636 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.

jarvis_dft_3d: Various properties of 25,923 bulk materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.

jarvis_ml_dft_training: Various properties of 24,759 bulk and 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.

m2ax: Elastic properties of 223 stable M2AX compounds from "A comprehensive survey of M2AX phase elastic properties" by Cover et al. Calculations are PAW PW91.

matbench_dielectric: Matbench v0.1 test dataset for predicting refractive index from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having refractive indices less than 1 and those containing noble gases. Retrieved April 2, 2019.

matbench_expt_gap: Matbench v0.1 test dataset for predicting experimental band gap from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, removing compositions with reported band gaps spanning more than a 0.1eV range; remaining compositions were assigned values based on the closest experimental value to the mean experimental value for that composition among all reports.

matbench_expt_is_metal: Matbench v0.1 test dataset for classifying metallicity from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, ensuring no conflicting reports were entered for any compositions (i.e., no reported compositions were both metal and nonmetal).

matbench_glass: Matbench v0.1 test dataset for predicting full bulk metallic glass formation ability from chemical formula. Retrieved from "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. Deduplicated according to composition, ensuring no compositions were reported as both GFA and not GFA (i.e., all reports agreed on the classification designation).

matbench_jdft2d: Matbench v0.1 test dataset for predicting exfoliation energies from crystal structure (computed with the OptB88vdW and TBmBJ functionals). Adapted from the JARVIS DFT database.

matbench_log_gvrh: Matbench v0.1 test dataset for predicting DFT log10 VRH-average shear modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019.

matbench_log_kvrh: Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019.

matbench_mp_e_form: Matbench v0.1 test dataset for predicting DFT formation energy from structure. Adapted from Materials Project database. Removed entries having formation energy more than 3.0eV and those containing noble gases. Retrieved April 2, 2019.

matbench_mp_gap: Matbench v0.1 test dataset for predicting DFT PBE band gap from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019.

matbench_mp_is_metal: Matbench v0.1 test dataset for predicting DFT metallicity from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases.. Retrieved April 2, 2019.

matbench_perovskites: Matbench v0.1 test dataset for predicting formation energy from crystal structure. Adapted from an original dataset generated by Castelli et al.

matbench_phonons: Matbench v0.1 test dataset for predicting vibration properties from crystal structure. Original data retrieved from Petretto et al. Original calculations done via ABINIT in the harmonic approximation based on density functional perturbation theory. Removed entries having a formation energy (or energy above the convex hull) more than 150meV.

matbench_steels: Matbench v0.1 dataset for predicting steel yield strengths from chemical composition alone. Retrieved from Citrine informatics. Deduplicated.

mp_all_20181018: A complete copy of the Materials Project database as of 10/18/2018. mp_all files contain structure data for each material while mp_nostruct does not.

mp_nostruct_20181018: A complete copy of the Materials Project database as of 10/18/2018. mp_all files contain structure data for each material while mp_nostruct does not.

phonon_dielectric_mp: Phonon (lattice/atoms vibrations) and dielectric properties of 1296 compounds computed via ABINIT software package in the harmonic approximation based on density functional perturbation theory.

piezoelectric_tensor: 941 structures with piezoelectric properties, calculated with DFT-PBE.

steel_strength: 312 steels with experimental yield strength and ultimate tensile strength, extracted and cleaned (including de-duplicating) from Citrine.

wolverton_oxides: 4,914 perovskite oxides containing composition data, lattice constants, and formation + vacancy formation energies. All perovskites are of the form ABO3. Adapted from a dataset presented by Emery and Wolverton.



['boltztrap_mp',
 'brgoch_superhard_training',
 'castelli_perovskites',
 'citrine_thermal_conductivity',
 'dielectric_constant',
 'double_perovskites_gap',
 'double_perovskites_gap_lumo',
 'elastic_tensor_2015',
 'expt_formation_enthalpy',
 'expt_gap',
 'flla',
 'glass_binary',
 'glass_binary_v2',
 'glass_ternary_hipt',
 'glass_ternary_landolt',
 'heusler_magnetic',
 'jarvis_dft_2d',
 'jarvis_dft_3d',
 'jarvis_ml_dft_training',
 'm2ax',
 'matbench_dielectric',
 'matbench_expt_gap',
 'matbench_expt_is_metal',
 'matbench_glass',
 'matbench_jdft2d',
 'matbench_log_gvrh',
 'matbench_log_kvrh',
 'matbench_mp_e_form',
 'matbench_mp_gap',
 'matbench_mp_is_metal',
 'matbench_perovskites',
 'matbench_phonons',
 'matbench_steels',
 'mp_all_20181018',
 'mp_nostruct_20181018',
 'phonon_dielectric_mp',
 'piezoelectric_tensor',
 'steel_strength',
 'wolverton_oxides']

Datasets can be loaded using the load_dataset() function and the database name. To save installation space, the datasets are not automatically downloaded when matminer is installed. Instead, the first time the dataset is loaded, it will be downloaded from the internet and stored in the matminer installation directory.

Let's load the dielectric_constant dataset. It contains 1,056 structures with dielectric properties calculated with DFPT-PBE.

from matminer.datasets import load_dataset

df = load_dataset("dielectric_constant")

Manipulating and examining pandas DataFrame objects

The datasets are made available as pandas DataFrame objects. You can think of these as a type of "spreadsheet" object in Python. DataFrames have several useful methods you can use to explore and clean the data, some of which we'll explore below.

Inspecting the dataset

The head() function prints a summary of the first few rows of a data set. You can scroll across to see more columns. From this, it is easy to see the types of data available in in the dataset.

df.head()
material_id formula nsites space_group volume structure band_gap e_electronic e_total n poly_electronic poly_total pot_ferroelectric cif meta poscar
0 mp-441 Rb2Te 3 225 159.501208 [[1.75725875 1.2425695 3.04366125] Rb, [5.271... 1.88 [[3.44115795, -3.097e-05, -6.276e-05], [-2.837... [[6.23414745, -0.00035252, -9.796e-05], [-0.00... 1.86 3.44 6.23 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1 mp-22881 CdCl2 3 166 84.298097 [[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13... 3.52 [[3.34688382, -0.04498543, -0.22379197], [-0.0... [[7.97018673, -0.29423886, -1.463590159999999]... 1.78 3.16 6.73 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2 mp-28013 MnI2 3 164 108.335875 [[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+... 1.17 [[5.5430849, -5.28e-06, -2.5030000000000003e-0... [[13.80606079, 0.0006911900000000001, 9.655e-0... 2.23 4.97 10.64 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3 mp-567290 LaN 4 186 88.162562 [[-1.73309900e-06 2.38611186e+00 5.95256328e... 1.12 [[7.09316738, 7.99e-06, -0.0003864700000000000... [[16.79535386, 8.199999999999997e-07, -0.00948... 2.65 7.04 17.99 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4 mp-560902 MnF2 6 136 82.826401 [[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M... 2.87 [[2.4239622, 7.452000000000001e-05, 6.06100000... [[6.44055613, 0.0020446600000000002, 0.0013203... 1.53 2.35 7.12 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ... Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...

Sometimes, if a dataset is very large, you will be unable to see all the available columns. Instead, you can see the full list of columns using the columns attribute:

df.columns
Index(['material_id', 'formula', 'nsites', 'space_group', 'volume',
       'structure', 'band_gap', 'e_electronic', 'e_total', 'n',
       'poly_electronic', 'poly_total', 'pot_ferroelectric', 'cif', 'meta',
       'poscar'],
      dtype='object')

A pandas DataFrame includes a function called describe() that helps determine statistics for the various numerical/categorical columns in the data. Note that the describe() function only describes numerical columns by default.

Sometimes, the describe() function will reveal outliers that indicate mistakes in the data.

df.describe()
nsites space_group volume band_gap n poly_electronic poly_total
count 1056.000000 1056.000000 1056.000000 1056.000000 1056.000000 1056.000000 1056.000000
mean 7.530303 142.970644 166.420376 2.119432 2.434886 7.248049 14.777898
std 3.388443 67.264591 97.425084 1.604924 1.148849 13.054947 19.435303
min 2.000000 1.000000 13.980548 0.110000 1.280000 1.630000 2.080000
25% 5.000000 82.000000 96.262337 0.890000 1.770000 3.130000 7.557500
50% 8.000000 163.000000 145.944691 1.730000 2.190000 4.790000 10.540000
75% 9.000000 194.000000 212.106405 2.885000 2.730000 7.440000 15.482500
max 20.000000 229.000000 597.341134 8.320000 16.030000 256.840000 277.780000

Indexing the dataset

We can access a particular column of DataFrame by indexing the object using the column name. For example:

df["band_gap"]
0       1.88
1       3.52
2       1.17
3       1.12
4       2.87
        ... 
1051    0.87
1052    3.60
1053    0.14
1054    0.21
1055    0.26
Name: band_gap, Length: 1056, dtype: float64

Alternatively, we can access a particular row of a Dataframe using the iloc attribute.

df.iloc[100]
material_id                                                    mp-7140
formula                                                            SiC
nsites                                                               4
space_group                                                        186
volume                                                         42.0055
structure            [[-1.87933700e-06  1.78517223e+00  2.53458835e...
band_gap                                                           2.3
e_electronic         [[6.9589498, -3.29e-06, 0.0014472600000000001]...
e_total              [[10.193825310000001, -3.7090000000000006e-05,...
n                                                                 2.66
poly_electronic                                                   7.08
poly_total                                                       10.58
pot_ferroelectric                                                False
cif                  #\#CIF1.1\n###################################...
meta                 {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...
poscar               Si2 C2\n1.0\n3.092007 0.000000 0.000000\n-1.54...
Name: 100, dtype: object

Filtering the dataset

Pandas DataFrame objects make it very easy to filter the data based on a specific column. We can use the typical Python comparison operators (==, >, >=, <, etc) to filter numerical values. For example, let's find all entries where the cell volume is greater than 580. We do this by filtering on the volume column.

Note that we first produce a boolean mask – a series of True and False depending on the comparison. We can then use the mask to filter the DataFrame.

mask = df["volume"] >= 580
df[mask]
material_id formula nsites space_group volume structure band_gap e_electronic e_total n poly_electronic poly_total pot_ferroelectric cif meta poscar
206 mp-23280 AsCl3 16 19 582.085309 [[0.13113333 7.14863883 9.63476955] As, [2.457... 3.99 [[2.2839161900000002, 0.00014519, -2.238000000... [[2.49739759, 0.00069379, 0.00075864], [0.0004... 1.57 2.47 3.30 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... As4 Cl12\n1.0\n4.652758 0.000000 0.000000\n0.0...
216 mp-9064 RbTe 12 189 590.136085 [[6.61780282 0. 0. ] Rb, [1.750... 0.43 [[3.25648277, 5.9650000000000007e-05, 1.57e-06... [[5.34517928, 0.00022474000000000002, -0.00018... 2.05 4.20 6.77 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Rb6 Te6\n1.0\n10.118717 0.000000 0.000000\n-5....
219 mp-23230 PCl3 16 62 590.637274 [[6.02561815 8.74038483 7.55586375] P, [2.7640... 4.03 [[2.39067769, 0.00017593, 8.931000000000001e-0... [[2.80467218, 0.00034093000000000003, 0.000692... 1.52 2.31 2.76 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... P4 Cl12\n1.0\n6.523152 0.000000 0.000000\n0.00...
251 mp-2160 Sb2Se3 20 62 597.341134 [[3.02245275 0.42059268 1.7670481 ] Sb, [ 1.00... 0.76 [[19.1521058, 5.5e-06, 0.00025268], [-1.078000... [[81.93819038000001, 0.0006755800000000001, 0.... 3.97 15.76 63.53 True #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Sb8 Se12\n1.0\n4.029937 0.000000 0.000000\n0.0...

We can use this method of filtering to clean our dataset. For example, if we only wanted our dataset to include semiconductors (materials with a non-zero band gap), we can do this easily by filtering the band_gap column.

mask = df["band_gap"] > 0
semiconductor_df = df[mask]
semiconductor_df
material_id formula nsites space_group volume structure band_gap e_electronic e_total n poly_electronic poly_total pot_ferroelectric cif meta poscar
0 mp-441 Rb2Te 3 225 159.501208 [[1.75725875 1.2425695 3.04366125] Rb, [5.271... 1.88 [[3.44115795, -3.097e-05, -6.276e-05], [-2.837... [[6.23414745, -0.00035252, -9.796e-05], [-0.00... 1.86 3.44 6.23 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1 mp-22881 CdCl2 3 166 84.298097 [[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13... 3.52 [[3.34688382, -0.04498543, -0.22379197], [-0.0... [[7.97018673, -0.29423886, -1.463590159999999]... 1.78 3.16 6.73 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2 mp-28013 MnI2 3 164 108.335875 [[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+... 1.17 [[5.5430849, -5.28e-06, -2.5030000000000003e-0... [[13.80606079, 0.0006911900000000001, 9.655e-0... 2.23 4.97 10.64 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3 mp-567290 LaN 4 186 88.162562 [[-1.73309900e-06 2.38611186e+00 5.95256328e... 1.12 [[7.09316738, 7.99e-06, -0.0003864700000000000... [[16.79535386, 8.199999999999997e-07, -0.00948... 2.65 7.04 17.99 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4 mp-560902 MnF2 6 136 82.826401 [[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M... 2.87 [[2.4239622, 7.452000000000001e-05, 6.06100000... [[6.44055613, 0.0020446600000000002, 0.0013203... 1.53 2.35 7.12 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ... Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1051 mp-568032 Cd(InSe2)2 7 111 212.493121 [[0. 0. 0.] Cd, [2.9560375 0. 3.03973 ... 0.87 [[7.74896783, 0.0, 0.0], [0.0, 7.74896783, 0.0... [[11.85159471, 1e-08, 0.0], [1e-08, 11.8515962... 2.77 7.67 11.76 True #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Cd1 In2 Se4\n1.0\n5.912075 0.000000 0.000000\n...
1052 mp-696944 LaHBr2 8 194 220.041363 [[2.068917 3.58317965 3.70992025] La, [4.400... 3.60 [[4.40504391, 6.1e-07, 0.0], [6.1e-07, 4.40501... [[8.77136355, 1.649999999999999e-06, 0.0], [1.... 2.00 3.99 7.08 True #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... La2 H2 Br4\n1.0\n4.137833 0.000000 0.000000\n-...
1053 mp-16238 Li2AgSb 4 216 73.882306 [[1.35965225 0.96141925 2.354987 ] Li, [2.719... 0.14 [[212.60750153, -1.843e-05, 0.0], [-1.843e-05,... [[232.59707383, -0.0005407400000000001, 0.0025... 14.58 212.61 232.60 True #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Li2 Ag1 Sb1\n1.0\n4.078957 0.000000 2.354987\n...
1054 mp-4405 Rb3AuO 5 221 177.269065 [[0. 2.808758 2.808758] Rb, [2.808758 2.... 0.21 [[6.40511712, 0.0, 0.0], [0.0, 6.40511712, 0.0... [[22.43799785, 0.0, 0.0], [0.0, 22.4380185, 0.... 2.53 6.41 22.44 True #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Rb3 Au1 O1\n1.0\n5.617516 0.000000 0.000000\n0...
1055 mp-3486 KSnSb 6 186 227.725015 [[-1.89006800e-06 2.56736395e+00 1.32914373e... 0.26 [[13.85634957, 1.8e-06, 0.0], [1.8e-06, 13.856... [[16.45311887, 1.64e-06, -0.00019139], [1.64e-... 3.53 12.47 15.55 True #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... K2 Sn2 Sb2\n1.0\n4.446803 0.000000 0.000000\n-...

1056 rows × 16 columns

Often, a dataset contains many additional columns that are not necessary for machine learning. Before we can train a model on the data, we need to remove any extraneous columns. We can remove whole columns from the dataset using the drop() function. This function can be used to drop both rows and columns.

The function takes a list of items to drop. For columns, this is column names whereas for rows it is the row number. Finally, the axis option specifies whether the data to drop is columns (1) or rows (0).

For example, to remove the nsites, space_group, e_electronic, and e_total columns, we can run:

cleaned_df = df.drop(["nsites", "space_group", "e_electronic", "e_total"],
                     axis=1)

Let's examine the cleaned DataFrame to see that the columns have been removed.

cleaned_df.head()
material_id formula volume structure band_gap n poly_electronic poly_total pot_ferroelectric cif meta poscar
0 mp-441 Rb2Te 159.501208 [[1.75725875 1.2425695 3.04366125] Rb, [5.271... 1.88 1.86 3.44 6.23 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1 mp-22881 CdCl2 84.298097 [[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13... 3.52 1.78 3.16 6.73 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2 mp-28013 MnI2 108.335875 [[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+... 1.17 2.23 4.97 10.64 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3 mp-567290 LaN 88.162562 [[-1.73309900e-06 2.38611186e+00 5.95256328e... 1.12 2.65 7.04 17.99 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4 mp-560902 MnF2 82.826401 [[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M... 2.87 1.53 2.35 7.12 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ... Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...

Generating new columns

Pandas DataFrame objects also make it easy to perform simple calculations on the data. Think of this as using formulas in Excel spreadsheets. All fundamental Python math operators (such as +, -, /, and *) can be used.

For example, the dielectric dataset contains the electronic contribution to the dielectric constant (\(\epsilon_\mathrm{electronic}\), in the poly_electronic column) and the total (static) dielectric constant (\(\epsilon_\mathrm{total}\), in the poly_total column). The ionic contribution to the dataset is given by:

\[ \epsilon_\mathrm{ionic} = \epsilon_\mathrm{total} - \epsilon_\mathrm{electronic} \]

Below, we calculate the ionic contribution to the dielectric constant and store it in a new column called poly_ionic. This is as simple as assigning the data to the new column, even if the column doesn't already exist.

df["poly_ionic"] = df["poly_total"] - df["poly_electronic"]

Let's check the new data was added correctly.

df.head()
material_id formula nsites space_group volume structure band_gap e_electronic e_total n poly_electronic poly_total pot_ferroelectric cif meta poscar poly_ionic
0 mp-441 Rb2Te 3 225 159.501208 [[1.75725875 1.2425695 3.04366125] Rb, [5.271... 1.88 [[3.44115795, -3.097e-05, -6.276e-05], [-2.837... [[6.23414745, -0.00035252, -9.796e-05], [-0.00... 1.86 3.44 6.23 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75... 2.79
1 mp-22881 CdCl2 3 166 84.298097 [[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13... 3.52 [[3.34688382, -0.04498543, -0.22379197], [-0.0... [[7.97018673, -0.29423886, -1.463590159999999]... 1.78 3.16 6.73 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78... 3.57
2 mp-28013 MnI2 3 164 108.335875 [[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+... 1.17 [[5.5430849, -5.28e-06, -2.5030000000000003e-0... [[13.80606079, 0.0006911900000000001, 9.655e-0... 2.23 4.97 10.64 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07... 5.67
3 mp-567290 LaN 4 186 88.162562 [[-1.73309900e-06 2.38611186e+00 5.95256328e... 1.12 [[7.09316738, 7.99e-06, -0.0003864700000000000... [[16.79535386, 8.199999999999997e-07, -0.00948... 2.65 7.04 17.99 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F... La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06... 10.95
4 mp-560902 MnF2 6 136 82.826401 [[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M... 2.87 [[2.4239622, 7.452000000000001e-05, 6.06100000... [[6.44055613, 0.0020446600000000002, 0.0013203... 1.53 2.35 7.12 False #\#CIF1.1\n###################################... {u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ... Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000... 4.77

Part 2: Generating descriptors for machine learning

In this section, we will learn a bit about how to generate machine-learning descriptors from materials objects in pymatgen. First, we'll generate some descriptors with matminer's "featurizers" classes. Next, we'll use some of what we learned about dataframes in the previous section to examine our descriptors and prepare them for input to machine learning models.

featurizers overview

Featurizers transform materials primitives into machine-learnable features

The general idea of featurizers is that they accept a materials primitive (e.g., pymatgen Composition) and output a vector. For example:

\[\begin{align} f(\mathrm{Fe}_2\mathrm{O}_3) \rightarrow [1.5, 7.8, 9.1, 0.09] \end{align}\]

Matminer contains featurizers for the following pymatgen objects:

  • Composition
  • Crystal structure
  • Crystal sites
  • Bandstructure
  • Density of states

Depending on the featurizer, the features returned may be:

  • numerical, categorical, or mixed vectors
  • matrices
  • other pymatgen objects (for further processing)

Featurizers play nice with dataframes

Since most of the time we are working with pandas dataframes, all featurizers work natively with pandas dataframes. We'll provide examples of this later in the lesson.

Featurizers present in matminer

Matminer hosts over 60 featurizers, most of which are implemented from methods published in peer reviewed papers. You can find a full list of featurizers on the matminer website. All featurizers have parallelization and convenient error tolerance built into their core methods.

In this lesson, we'll go over the main methods present in all featurizers. By the end of this unit, you will be able to generate descriptors for a wide range of materials informatics problems using one common software interface.

The featurize method and basics

The core method of any matminer is "featurize". This method accepts a materials object and returns a machine learning vector or matrix. Let's see an example on a pymatgen composition:

from pymatgen import Composition

fe2o3 = Composition("Fe2O3")

As a trivial example, we'll get the element fractions with the ElementFraction featurizer.

from matminer.featurizers.composition import ElementFraction

ef = ElementFraction()

Now we can featurize our composition.

element_fractions = ef.featurize(fe2o3)

print(element_fractions)
[0, 0, 0, 0, 0, 0, 0, 0.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

We've managed to generate features for learning, but what do they mean? One way to check is by reading the Features section in the documentation of any featurizer... but a much easier way is to use the feature_labels() method.

element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)
['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr']

We now see the labels in the order that we generated the features.

print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[25], element_fractions[25])
O 0.6
Fe 0.4

Featurizing dataframes

We just generated some descriptors and their labels from an individual sample but most of the time our data is in pandas dataframes. Fortunately, matminer featurizers implement a featurize_dataframe() method which interacts natively with dataframes.

Let's grab a new dataset from matminer and use our ElementFraction featurizer on it.

First, we download a dataset as we did in the previous unit. In this example, we'll download a dataset of super hard materials.

from matminer.datasets.dataset_retrieval import load_dataset

df = load_dataset("brgoch_superhard_training")
df.head()
formula bulk_modulus shear_modulus composition material_id structure brgoch_feats suspect_value
0 AlPt3 225.230461 91.197748 (Al, Pt) mp-188 [[0. 0. 0.] Al, [0. 1.96140395 1.96140... {'atomic_number_feat_1': 123.5, 'atomic_number... False
1 Mn2Nb 232.696340 74.590157 (Mn, Nb) mp-12659 [[-2.23765223e-08 1.42974191e+00 5.92614104e... {'atomic_number_feat_1': 45.5, 'atomic_number_... False
2 HfO2 204.573433 98.564374 (Hf, O) mp-352 [[2.24450185 3.85793022 4.83390736] O, [2.7788... {'atomic_number_feat_1': 44.0, 'atomic_number_... False
3 Cu3Pt 159.312640 51.778816 (Cu, Pt) mp-12086 [[0. 1.86144248 1.86144248] Cu, [1.861... {'atomic_number_feat_1': 82.5, 'atomic_number_... False
4 Mg3Pt 69.637565 27.588765 (Mg, Pt) mp-18707 [[0. 0. 2.73626461] Mg, [0. ... {'atomic_number_feat_1': 57.0, 'atomic_number_... False

Next, we can use the featurize_dataframe() method (implemented by all featurizers) to apply ElementFraction to all of our data at once. The only required arguments are the dataframe as input and the input column name (in this case it is composition). featurize_dataframe() is parallelized by default using multiprocessing.

df = ef.featurize_dataframe(df, "composition")


If we look at the database we can see our new feature columns.

df.head()
formula bulk_modulus shear_modulus composition material_id structure brgoch_feats suspect_value H He ... Pu Am Cm Bk Cf Es Fm Md No Lr
0 AlPt3 225.230461 91.197748 (Al, Pt) mp-188 [[0. 0. 0.] Al, [0. 1.96140395 1.96140... {'atomic_number_feat_1': 123.5, 'atomic_number... False 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 Mn2Nb 232.696340 74.590157 (Mn, Nb) mp-12659 [[-2.23765223e-08 1.42974191e+00 5.92614104e... {'atomic_number_feat_1': 45.5, 'atomic_number_... False 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 HfO2 204.573433 98.564374 (Hf, O) mp-352 [[2.24450185 3.85793022 4.83390736] O, [2.7788... {'atomic_number_feat_1': 44.0, 'atomic_number_... False 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 Cu3Pt 159.312640 51.778816 (Cu, Pt) mp-12086 [[0. 1.86144248 1.86144248] Cu, [1.861... {'atomic_number_feat_1': 82.5, 'atomic_number_... False 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 Mg3Pt 69.637565 27.588765 (Mg, Pt) mp-18707 [[0. 0. 2.73626461] Mg, [0. ... {'atomic_number_feat_1': 57.0, 'atomic_number_... False 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 111 columns

Structure Featurizers

We can use the same syntax for other kinds of featurizers. Let's now assign descriptors to a structure. We do this with the same syntax as the composition featurizers. First, let's load a dataset containing structures.

df = load_dataset("phonon_dielectric_mp")

df.head()
mpid eps_electronic eps_total last phdos peak structure formula
0 mp-1000 6.311555 12.773454 98.585771 [[2.8943817 2.04663693 5.01321616] Te, [0. 0.... BaTe
1 mp-1002124 24.137743 32.965593 677.585725 [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78... HfC
2 mp-1002164 8.111021 11.169464 761.585719 [[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45... GeC
3 mp-10044 10.032168 10.128936 701.585723 [[0.98372595 0.69559929 1.70386332] B, [0. 0. ... BAs
4 mp-1008223 3.979201 6.394043 204.585763 [[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se] CaSe

Let's calculate some basic density features of these structures using DensityFeatures.

from matminer.featurizers.structure import DensityFeatures

densityf = DensityFeatures()
densityf.feature_labels()
['density', 'vpa', 'packing fraction']

These are the features we will get. Now we use featurize_dataframe() to generate these features for all the samples in the dataframe. Since we are using the structures as input to the featurizer, we select the "structure" column.

df = densityf.featurize_dataframe(df, "structure")


Let's examine the dataframe and see the structural features.

Conversion Featurizers

In addition to Bandstructure/DOS/Structure/Composition featurizers, matminer also provides a featurizer interface for converting between pymatgen objects (e.g., assinging oxidation states to compositions) in a fault-tolerant fashion. These featurizers are found in matminer.featurizers.conversion and work with the same featurize/featurize_dataframe etc. syntax as the other featurizers.

The dataset we loaded previously only contains a formula column with string objects. To convert this data into a composition column containing pymatgen Composition objects, we can use the StrToComposition conversion featurizer on the formula column.

from matminer.featurizers.conversions import StrToComposition

stc = StrToComposition()
df = stc.featurize_dataframe(df, "formula")


We can see a new composition column has been added to the dataframe.

df.head()
mpid eps_electronic eps_total last phdos peak structure formula density vpa packing fraction composition
0 mp-1000 6.311555 12.773454 98.585771 [[2.8943817 2.04663693 5.01321616] Te, [0. 0.... BaTe 4.937886 44.545547 0.596286 (Ba, Te)
1 mp-1002124 24.137743 32.965593 677.585725 [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78... HfC 9.868234 16.027886 0.531426 (Hf, C)
2 mp-1002164 8.111021 11.169464 761.585719 [[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45... GeC 5.760895 12.199996 0.394180 (Ge, C)
3 mp-10044 10.032168 10.128936 701.585723 [[0.98372595 0.69559929 1.70386332] B, [0. 0. ... BAs 5.087634 13.991016 0.319600 (B, As)
4 mp-1008223 3.979201 6.394043 204.585763 [[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se] CaSe 2.750191 35.937000 0.428523 (Ca, Se)

Advanced capabilities

There are powerful functionalities of Featurizers which are worth quickly mentioning before we go practice (and many more not mentioned here).

Dealing with Errors

Often, data is messy and certain featurizers will encounter errors. Set ignore_errors=True in featurize_dataframe() to skip errors; if you'd like to see the errors returned in an additional column, also set return_errors=True.

Citing the authors

Many featurizers are implemented using methods found in peer reviewed studies. Please cite these original works using the citations() method, which returns the BibTex-formatted references in a Python list. For example:

Part 3: Machine learning models

In parts 1 and 2, we demonstrated how to download a dataset and add machine learnable features. In part 3, we show how to train a machine learning model on a dataset and analyze the results.

Scikit-Learn

This unit makes extensive use of the scikit-learn package, an open-source python package for machine learning. Matminer has been designed to make machine learning with scikit-learn as easy as possible. Other machine learning packages exist, such as TensorFlow, which implement neural network architectures. These packages can also be used with matminer but are outside the scope of this workshop.

Load and prepare a pre-featurized model

First, let's load a dataset that we can use for machine learning. In advance, we've added some composition and structure features to the elastic_tensor_2015 dataset used in exercises 1 and 2.

import os
from matminer.utils.io import load_dataframe_from_json

df = load_dataframe_from_json(os.path.join("resources", "elastic_tensor_2015_featurized.json"))
df.head()
structure formula K_VRH composition MagpieData minimum Number MagpieData maximum Number MagpieData range Number MagpieData mean Number MagpieData avg_dev Number MagpieData mode Number ... MagpieData mode GSmagmom MagpieData minimum SpaceGroupNumber MagpieData maximum SpaceGroupNumber MagpieData range SpaceGroupNumber MagpieData mean SpaceGroupNumber MagpieData avg_dev SpaceGroupNumber MagpieData mode SpaceGroupNumber density vpa packing fraction
0 [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... Nb4CoSi 194.268884 (Nb, Co, Si) 14.0 41.0 27.0 34.166667 9.111111 41.0 ... 0.0 194.0 229.0 35.0 222.833333 9.611111 229.0 7.834556 16.201654 0.688834
1 [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... Al(CoSi)2 175.449907 (Al, Co, Si) 13.0 27.0 14.0 19.000000 6.400000 14.0 ... 0.0 194.0 227.0 33.0 213.400000 15.520000 194.0 5.384968 12.397466 0.644386
2 [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] SiOs 295.077545 (Si, Os) 14.0 76.0 62.0 45.000000 31.000000 14.0 ... 0.0 194.0 227.0 33.0 210.500000 16.500000 194.0 13.968635 12.976265 0.569426
3 [[0. 1.09045794 0.84078375] Ga, [0. ... Ga 49.130670 (Ga) 31.0 31.0 0.0 31.000000 0.000000 31.0 ... 0.0 64.0 64.0 0.0 64.000000 0.000000 64.0 6.036267 19.180359 0.479802
4 [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... SiRu2 256.768081 (Si, Ru) 14.0 44.0 30.0 34.000000 13.333333 44.0 ... 0.0 194.0 227.0 33.0 205.000000 14.666667 194.0 9.539514 13.358418 0.598395

5 rows × 139 columns

We first need to split the dataset into the "target" property, and the "features" used for learning. In this model, we will be using the bulk modulus (K_VRH) as the target property. We use the values attribute of the dataframe to give the target properties a numpy array, rather than pandas Series object.

y = df['K_VRH'].values

print(y)
[194.26888436 175.44990675 295.07754499 ...  89.41816126  99.3845653
  35.93865993]

The machine learning algorithm can only use numerical features for training. Accordingly, we need to remove any non-numerical columns from our dataset. Additionally, we want to remove the K_VRH column from the set of features, as the model should not know about the target property in advance.

The dataset loaded above, includes structure, formula, and composition columns that were previously used to generate the machine learnable features. Let's remove them using the pandas drop() function, discussed in unit 1. Remember, axis=1 indicates we are dropping columns rather than rows.

X = df.drop(["structure", "formula", "composition", "K_VRH"], axis=1)

We can see all the descriptors in model using the column attribute.

print("There are {} possible descriptors:".format(X.columns))
print(X.columns)
There are Index(['MagpieData minimum Number', 'MagpieData maximum Number',
       'MagpieData range Number', 'MagpieData mean Number',
       'MagpieData avg_dev Number', 'MagpieData mode Number',
       'MagpieData minimum MendeleevNumber',
       'MagpieData maximum MendeleevNumber',
       'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber',
       ...
       'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber',
       'MagpieData maximum SpaceGroupNumber',
       'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber',
       'MagpieData avg_dev SpaceGroupNumber',
       'MagpieData mode SpaceGroupNumber', 'density', 'vpa',
       'packing fraction'],
      dtype='object', length=135) possible descriptors:
Index(['MagpieData minimum Number', 'MagpieData maximum Number',
       'MagpieData range Number', 'MagpieData mean Number',
       'MagpieData avg_dev Number', 'MagpieData mode Number',
       'MagpieData minimum MendeleevNumber',
       'MagpieData maximum MendeleevNumber',
       'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber',
       ...
       'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber',
       'MagpieData maximum SpaceGroupNumber',
       'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber',
       'MagpieData avg_dev SpaceGroupNumber',
       'MagpieData mode SpaceGroupNumber', 'density', 'vpa',
       'packing fraction'],
      dtype='object', length=135)

Try a random forest model using scikit-learn

The scikit-learn library makes it easy to use our generated features for training machine learning models. It implements a variety of different regression models and contains tools for cross-validation.

In the interests of time, in this example we will only trial a single model but it is good practice to trial multiple models to see which performs best for your machine learning problem. A good "starting" model is the random forest model. Let's create a random forest model.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=1)

Notice we created the model with the number of estimators (n_estimators) set to 100. n_estimators is an example of a machine learning hyper-parameter. Most models contain many tunable hyper-parameters. To obtain good performance, it is necessary to fine tune these parameters for each individual machine learning problem. There is currently no simple way to know in advance what hyper-parameters will be optimal. Usually, a trial and error approach is used.

We can now train our model to use the input features (X) to predict the target property (y). This is achieved using the fit() function.

rf.fit(X, y)
RandomForestRegressor(random_state=1)

That's it, we have trained our first machine learning model!

Evaluating model performance

Next, we need to assess how the model is performing. To do this, we first ask the model to predict the bulk modulus for every entry in our original dataframe.

y_pred = rf.predict(X)

Next, we can check the accuracy of our model by looking at the root mean squared error of our predictions. Scikit-learn provides a mean_squared_error() function to calculate the mean squared error. We then take the square-root of this to obtain our final performance metric.

import numpy as np
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y, y_pred)
print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))
training RMSE = 7.272 GPa

An RMSE of 7.2 GPa looks very reasonable! However, as the model was trained and evaluated on exactly the same data, this is not a true estimate of how the model will perform for unseen materials (the primary purpose of machine learning studies).

Cross validation

To obtain a more accurate estimate of prediction performance and validate that we are not over-fitting, we need to check the cross-validation score rather than the fitting score.

In cross-validation, the data is partitioned randomly into \(n\) "splits" (in this case 10), each containing roughly the same number of samples. The model is trained on \(n-1\) splits (the training set) and the model performance evaluated by comparing the actual and predicted values for the final split (the testing set). In total, this process is repeated \(n\) times, such that each split is at some point used as the testing set. The cross-validation score is the average score across all testing sets.

There are a number of ways to partition the data into splits. In this example, we use the KFold method and select the number of splits to be 10. I.e., 90 % of the data will be used as the training set, with 10 % used as the testing set.

from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, random_state=1)
/Users/alex/miniconda3/envs/py3/lib/python3.8/site-packages/sklearn/model_selection/_split.py:293: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
  warnings.warn(

Note, we set random_state=1 to ensure every attendee gets the same answer for their model.

Finally, obtaining the cross validation score can be automated using the Scikit-Learn cross_val_score() function. This function requires a machine learning model, the input features, and target property as arguments. Note, we pass the kfold object as thecv argument, to make cross_val_score() use the correct test/train splits.

For each split, the model will be trained from scratch, before the performance is evaualated. As we have to train and predict 10 times, cross validation can often take some time to perform. In our case, the model is quite small, so the process only takes about a minute. The final cross validation score is the average across all splits.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)

rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))
Mean RMSE: 18.676

Notice that our RMSE has almost tripled as now it reflects the true predictive power of the model. However, a root-mean-squared error of ~18 GPa is still not bad!

Visualizing model performance

We can visualize the predictive performance of our model by plotting the our predictions against the actual value, for each sample in the test set for all test/train splits. First, we get the predicted values of the testing set for each split using the cross_val_predict method. This is similar to the cross_val_score method, except it returns the actual predictions, rather than the model score.

from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(rf, X, y, cv=kfold)

For plotting we use PlotlyFig module of matminer, which helps you quickly produce publication ready diagrams. PlotlyFig can produce many different types of plots. Explaining its use in detail is outside the scope of this tutorial but examples of the available plots are provided in the FigRecipes section of the matminer_examples repository.

from matminer.figrecipes.plot import PlotlyFig

pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
               y_title='Predicted bulk modulus (GPa)',
               mode='notebook')

pf.xy(xy_pairs=[(y, y_pred), ([0, 400], [0, 400])], 
      labels=df['formula'], 
      modes=['markers', 'lines'],
      lines=[{}, {'color': 'black', 'dash': 'dash'}], 
      showlegends=False)

Not too bad! However, there are definitely some outliers (you can hover over the points with your mouse to see what they are).

Model interpretation

An important aspect of machine learning is being able to understand why a model is making certain predictions. Random forest models are particularly amenable to interpretation as they possess a feature_importances attribute, which contains the importance of each feature in deciding the final prediction. Let's look at the feature importances of our model.

rf.feature_importances_
array([2.77737706e-04, 6.86497802e-04, 4.19014378e-04, 9.38306579e-04,
       6.27788172e-04, 8.11685429e-04, 5.85205797e-03, 4.13582985e-04,
       5.68896006e-03, 3.99784395e-03, 2.29068933e-03, 4.00079790e-03,
       2.80170565e-04, 1.12108238e-03, 5.04161260e-04, 7.21240737e-04,
       6.66258777e-04, 2.65253041e-04, 6.31863354e-02, 2.61334748e-02,
       1.69886830e-03, 5.43940586e-01, 3.79683746e-03, 1.84024489e-03,
       2.00724094e-02, 2.74160018e-04, 2.89023628e-04, 1.65750614e-03,
       1.70010289e-03, 8.44812934e-03, 4.07321100e-05, 4.20522484e-05,
       8.69871167e-05, 1.12538000e-03, 1.05324451e-03, 4.16679765e-05,
       3.96170134e-04, 1.48598578e-03, 6.16768132e-04, 3.24518244e-03,
       8.40211726e-04, 6.08207520e-04, 2.48464536e-03, 5.38387869e-04,
       9.44676379e-04, 1.17935220e-02, 1.54290424e-03, 3.42464373e-03,
       1.62873316e-04, 4.59487184e-05, 1.77379083e-04, 1.35098601e-03,
       3.88133058e-04, 3.33992917e-04, 2.17490708e-04, 3.12382175e-05,
       1.01795637e-04, 1.03213685e-03, 6.85435551e-04, 6.34344688e-05,
       2.02095869e-04, 2.66249318e-03, 3.74273103e-04, 7.21229249e-04,
       1.29571361e-03, 3.56619677e-04, 3.67797732e-06, 1.81902930e-05,
       1.66200350e-05, 2.20302332e-04, 1.52030919e-04, 1.48360193e-05,
       1.71809628e-03, 1.19092123e-03, 3.73628049e-04, 1.42148596e-03,
       6.57524332e-04, 6.28622233e-03, 1.16750735e-05, 4.36732403e-05,
       3.66858164e-05, 1.34265640e-03, 1.90009692e-04, 1.65555892e-04,
       5.84467323e-05, 2.62242035e-04, 3.16564734e-04, 4.75664484e-04,
       9.69183532e-04, 2.32465824e-05, 3.02435786e-04, 3.16506500e-04,
       1.37421635e-03, 1.55517264e-03, 4.35381175e-03, 1.13587974e-03,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.49055670e-03, 1.23382329e-03,
       1.39319470e-03, 5.25514449e-03, 1.45239205e-03, 2.80369169e-03,
       6.08487913e-04, 1.41780570e-02, 1.86595558e-03, 2.05837632e-02,
       2.18064410e-03, 3.69109295e-03, 8.52970476e-05, 4.96849240e-05,
       1.33715940e-04, 9.60365387e-04, 5.01421614e-04, 2.88261104e-04,
       1.88675594e-05, 9.08252204e-05, 1.04200208e-04, 4.82154177e-04,
       2.80548603e-04, 6.50107456e-05, 2.28368912e-04, 5.33462112e-04,
       3.00309809e-04, 1.14474431e-03, 1.02300569e-03, 3.41368312e-04,
       2.72149182e-02, 1.34503923e-01, 6.79013491e-03])

To make sense of this, we need to know which feature each number corresponds to. We can use PlotlyFig to plot the importances of the 5 most important features.

importances = rf.feature_importances_
included = X.columns.values
indices = np.argsort(importances)[::-1]

pf = PlotlyFig(y_title='Importance (%)',
               title='Feature by importances',
               mode='notebook')

pf.bar(x=included[indices][0:5], y=importances[indices][0:5])

Part 4: Automated machine learning using automatminer

Automatminer is a package for automatically creating ML pipelines using matminer's featurizers, feature reduction techniques, and Automated Machine Learning (AutoML). Automatminer works end to end - raw data to prediction - without any human input necessary.

automatminer logo

Put in a dataset, get out a machine that predicts materials properties.

Automatminer is competitive with state of the art hand-tuned machine learning models across multiple domains of materials informatics. Automatminer also included utilities for running MatBench, a materials science ML benchmark.

Learn more about Automatminer and MatBench from the official documentation.

How does automatminer work?

Automatminer automatically decorates a dataset using hundreds of descriptor techniques from matminer’s descriptor library, picks the most useful features for learning, and runs a separate AutoML pipeline. Once a pipeline has been fit, it can be summarized in a text file, saved to disk, or used to make predictions on new materials.

automatminer overview

Materials primitives (e.g., crystal structures) go in one end, and property predictions come out the other. MatPipe handles the intermediate operations such as assigning descriptors, cleaning problematic data, data conversions, imputation, and machine learning.

MatPipe is the main Automatminer object

MatPipe is the central object in Automatminer. It has a sklearn BaseEstimator syntax for fit and predict operations. Simply fit on your training data, then predict on your testing data.

MatPipe uses pandas dataframes as inputs and outputs.

Put dataframes (of materials) in, get dataframes (of property predictions) out.

Overview

In this section, we walk through the basic steps of using Automatminer to train and predict on data. We'll also view the internals of our AutoML pipeline using Automatminer's API.

  • First, we'll load a dataset of ~4,600 dielectric constants from the Materials Project.
  • Next, we'll fit a Automatminer MatPipe (pipeline) to the data
  • Then, we'll predict dielectric constants from the structure, and see how our predictions do (note, this is not an easy problem!)
  • We'll examine our pipeline with MatPipe's introspection methods.
  • Finally, we look at how to save and load pipelines for reproducible predictions.

Note: for the sake of brevity, we will use a single train-test split in this notebook. To run a full Automatminer benchmark, see the documentation for MatPipe.benchmark

Preparing a dataset for machine learning

Let's load a dataset to play around with. For this example, we will use matminer to load one of the MatBench v0.1 datasets.

df = load_dataset("matbench_dielectric")

By inspecting the dataset we can see that only the "structure" and "n" (dielectric constant) columns are present.

df.head()
structure n
0 [[4.29304147 2.4785886 1.07248561] S, [4.2930... 1.752064
1 [[3.95051434 4.51121437 0.28035002] K, [4.3099... 1.652859
2 [[-1.78688104 4.79604117 1.53044621] Rb, [-1... 1.867858
3 [[4.51438064 4.51438064 0. ] Mn, [0.133... 2.676887
4 [[-4.36731958 6.8886097 0.50929706] Li, [-2... 1.793232

Next we can generate a train-test split. For evaluating automatminer.

from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)

Let's remove the testing dataframe's target property so we can be sure we are not giving Automatminer any test information.

Our target variable is "n".

target = "n"
prediction_df = test_df.drop(columns=[target])
prediction_df.head()
structure
1802 [[3.71205866 2.14315394 1.14375057] Si, [-3.71...
1881 [[0. 0. 0.] Cd, [1.35314892 0.95682078 2.34372...
1288 [[-0.50714072 4.9893142 6.08288682] K, [-1....
4490 [[3.90704797 2.76270011 6.76720559] Si, [0.558...
32 [[1.91506173 1.23473956 4.58373805] P, [ 5.553...

Fitting and predicting with Automatminer's MatPipe

Now we have everything we need to start our AutoML pipeline. For simplicity, we will use a MatPipe preset. MatPipe is highly customizable and has hundreds of configuration options, but most use cases will be satisfied by using one of the preset configurations. We use the from_preset method.

In this example, due to interests of time we'll use the "debug" preset, which will spend approximately 1.5 minutes doing machine learning. The "express" preset is a good choice if you have more time available.

from automatminer import MatPipe

pipe = MatPipe.from_preset("debug")
/Users/alex/miniconda3/envs/py3/lib/python3.8/site-packages/tpot/builtins/__init__.py:36: UserWarning:

Warning: optional dependency `torch` is not available. - skipping import of NN models.

/Users/alex/miniconda3/envs/py3/lib/python3.8/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning:

sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.


Fitting the pipeline

To fit an Automatminer MatPipe to the data, pass in your training data and desired target.

pipe.fit(train_df, target)
2020-07-27 14:28:24 INFO     Problem type is: regression
2020-07-27 14:28:24 INFO     Fitting MatPipe pipeline to data.
2020-07-27 14:28:24 INFO     AutoFeaturizer: Starting fitting.
2020-07-27 14:28:24 INFO     AutoFeaturizer: Adding compositions from structures.
2020-07-27 14:28:24 INFO     AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.




2020-07-27 14:30:22 INFO     AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.


2020-07-27 14:30:50 INFO     AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.


2020-07-27 14:30:55 INFO     AutoFeaturizer: Featurizer type bandstructure not in the dataframe to be fitted. Skipping...
2020-07-27 14:30:55 INFO     AutoFeaturizer: Featurizer type dos not in the dataframe to be fitted. Skipping...
2020-07-27 14:30:55 INFO     AutoFeaturizer: Finished fitting.
2020-07-27 14:30:55 INFO     AutoFeaturizer: Starting transforming.
2020-07-27 14:30:55 INFO     AutoFeaturizer: Featurizing with ElementProperty.


2020-07-27 14:31:04 INFO     AutoFeaturizer: Featurizing with SineCoulombMatrix.


2020-07-27 14:31:22 INFO     AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping...
2020-07-27 14:31:22 INFO     AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping...
2020-07-27 14:31:22 INFO     AutoFeaturizer: Finished transforming.
2020-07-27 14:31:22 INFO     DataCleaner: Starting fitting.
2020-07-27 14:31:22 INFO     DataCleaner: Cleaning with respect to samples with sample na_method 'drop'
2020-07-27 14:31:22 INFO     DataCleaner: Replacing infinite values with nan for easier screening.
2020-07-27 14:31:22 INFO     DataCleaner: Before handling na: 3811 samples, 421 features
2020-07-27 14:31:22 INFO     DataCleaner: 0 samples did not have target values. They were dropped.
2020-07-27 14:31:22 INFO     DataCleaner: Handling feature na by max na threshold of 0.01 with method 'drop'.
2020-07-27 14:31:22 INFO     DataCleaner: After handling na: 3811 samples, 421 features
2020-07-27 14:31:22 INFO     DataCleaner: Finished fitting.
2020-07-27 14:31:22 INFO     FeatureReducer: Starting fitting.
2020-07-27 14:31:24 INFO     FeatureReducer: 284 features removed due to cross correlation more than 0.95
2020-07-27 14:35:22 INFO     TreeFeatureReducer: Finished tree-based feature reduction of 136 initial features to 11
2020-07-27 14:35:22 INFO     FeatureReducer: Finished fitting.
2020-07-27 14:35:22 INFO     FeatureReducer: Starting transforming.
2020-07-27 14:35:22 INFO     FeatureReducer: Finished transforming.
2020-07-27 14:35:22 INFO     TPOTAdaptor: Starting fitting.
27 operators have been imported by TPOT.

Skipped pipeline #3 due to time out. Continuing to the next pipeline.

1.04 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.
WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation.


TPOT closed prematurely. Will use the current best pipeline.
2020-07-27 14:36:25 INFO     TPOTAdaptor: Finished fitting.
2020-07-27 14:36:25 INFO     MatPipe successfully fit.

/Users/alex/miniconda3/envs/py3/lib/python3.8/site-packages/sklearn/base.py:209: FutureWarning:

From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.


MatPipe(autofeaturizer=AutoFeaturizer(bandstructure_col=None, exclude=[],
                                      featurizers={'bandstructure': [BandFeaturizer()],
                                                   'composition': [ElementProperty(data_source=<matminer.utils.data.MagpieData object at 0x7f92058afaf0>,
                                                                                   features=['Number',
                                                                                             'MendeleevNumber',
                                                                                             'AtomicWeight',
                                                                                             'MeltingT',
                                                                                             'Column',
                                                                                             'Row',
                                                                                             'CovalentRadius',
                                                                                             'Electronegativity',
                                                                                             'NsValence',
                                                                                             'NpVa...
                                                                                             'NpUnfilled',
                                                                                             'NdUnfilled',
                                                                                             'NfUnfilled',
                                                                                             'NUnfilled',
                                                                                             'GSvolume_pa',
                                                                                             'GSbandgap',
                                                                                             'GSmagmom',
                                                                                             'SpaceGroupNumber'],
                                                                                   stats=['minimum',
                                                                                          'maximum',
                                                                                          'range',
                                                                                          'mean',
                                                                                          'avg_dev',
                                                                                          'mode'])],
                                                   'dos': [DOSFeaturizer()],
                                                   'structure': [SineCoulombMatrix()]},
                                      ignore_cols=[], n_jobs=2,
                                      preset='debug'),
        cleaner=DataCleaner(), learner=TPOTAdaptor(),
        reducer=FeatureReducer(reducers=('corr', 'tree')))

Predicting new data

Our MatPipe is now fit. Let's predict our test data with MatPipe.predict. This should only take a few minutes.

prediction_df = pipe.predict(prediction_df)
2020-07-27 14:36:25 INFO     Beginning MatPipe prediction using fitted pipeline.
2020-07-27 14:36:25 INFO     AutoFeaturizer: Starting transforming.
2020-07-27 14:36:25 INFO     AutoFeaturizer: Adding compositions from structures.
2020-07-27 14:36:25 INFO     AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.




2020-07-27 14:37:09 INFO     AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.


2020-07-27 14:37:15 INFO     AutoFeaturizer: Featurizing with ElementProperty.


2020-07-27 14:37:19 INFO     AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.


2020-07-27 14:37:22 INFO     AutoFeaturizer: Featurizing with SineCoulombMatrix.


2020-07-27 14:37:28 INFO     AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping...
2020-07-27 14:37:28 INFO     AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping...
2020-07-27 14:37:28 INFO     AutoFeaturizer: Finished transforming.
2020-07-27 14:37:28 INFO     DataCleaner: Starting transforming.
2020-07-27 14:37:28 INFO     DataCleaner: Cleaning with respect to samples with sample na_method 'fill'
2020-07-27 14:37:28 INFO     DataCleaner: Replacing infinite values with nan for easier screening.
2020-07-27 14:37:28 INFO     DataCleaner: Before handling na: 953 samples, 420 features
2020-07-27 14:37:28 INFO     DataCleaner: After handling na: 953 samples, 420 features
2020-07-27 14:37:28 INFO     DataCleaner: Target not found in df columns. Ignoring...
2020-07-27 14:37:28 INFO     DataCleaner: Finished transforming.
2020-07-27 14:37:28 INFO     FeatureReducer: Starting transforming.
2020-07-27 14:37:28 WARNING  FeatureReducer: Target not found in columns to transform.
2020-07-27 14:37:28 INFO     FeatureReducer: Finished transforming.
2020-07-27 14:37:28 INFO     TPOTAdaptor: Starting predicting.
2020-07-27 14:37:28 INFO     TPOTAdaptor: Prediction finished successfully.
2020-07-27 14:37:28 INFO     TPOTAdaptor: Finished predicting.
2020-07-27 14:37:28 INFO     MatPipe prediction completed.

Examine predictions

MatPipe places the predictions a column called "{target} predicted":

prediction_df.head()
MagpieData range AtomicWeight MagpieData avg_dev AtomicWeight MagpieData mean MeltingT MagpieData maximum Electronegativity MagpieData mean Electronegativity MagpieData avg_dev Electronegativity MagpieData avg_dev NUnfilled MagpieData mean GSvolume_pa sine coulomb matrix eig 0 sine coulomb matrix eig 6 sine coulomb matrix eig 7 n predicted
1802 102.710600 48.656496 343.725333 3.44 2.745333 0.740978 0.995556 19.532667 7730.368815 5782.647332 5818.707983 2.056969
1881 15.189000 7.594500 658.440000 2.10 1.895000 0.205000 1.000000 27.129167 5374.973644 0.000000 0.000000 3.551161
1288 49.380600 11.876093 460.776364 3.44 2.520909 1.002645 0.727273 23.243939 1822.152522 324.153897 278.484976 1.882713
4490 0.000000 0.000000 1687.000000 1.90 1.900000 0.000000 0.000000 20.440000 303.528424 0.000000 0.000000 6.726282
32 32.572238 12.983210 714.687500 2.19 1.567500 0.560625 0.750000 37.208810 1691.751065 1584.601981 1584.601981 3.338190

Score predictions

Now let's score our predictions using mean average error, and compare them to a Dummy Regressor from sklearn.

from sklearn.metrics import mean_absolute_error
from sklearn.dummy import DummyRegressor

# fit the dummy
dr = DummyRegressor()
dr.fit(train_df["structure"], train_df[target])
dummy_test = dr.predict(test_df["structure"])


# Score dummy and MatPipe
true = test_df[target]
matpipe_test = prediction_df[target + " predicted"]

mae_matpipe = mean_absolute_error(true, matpipe_test)
mae_dummy = mean_absolute_error(true, dummy_test)

print("Dummy MAE: {}".format(mae_dummy))
print("MatPipe MAE: {}".format(mae_matpipe))
Dummy MAE: 0.7772666142371938
MatPipe MAE: 0.5030822760911582

Examining the internals of MatPipe

Inspect MatPipe internals with a dict/text digest from either MatPipe.inspect (long, comprehensive version of all proper attriute names) or MatPipe.summarize (executive summary).

import pprint

# Get a summary and save a copy to json
summary = pipe.summarize(filename="MatPipe_predict_experimental_gap_from_composition_summary.json")

pprint.pprint(summary)
{'data_cleaning': {'drop_na_targets': 'True',
                   'encoder': 'one-hot',
                   'feature_na_method': 'drop',
                   'na_method_fit': 'drop',
                   'na_method_transform': 'fill'},
 'feature_reduction': {'reducer_params': "{'tree': {'importance_percentile': "
                                         "0.9, 'mode': 'regression', "
                                         "'random_state': 0}}",
                       'reducers': "('corr', 'tree')"},
 'features': ['MagpieData range AtomicWeight',
              'MagpieData avg_dev AtomicWeight',
              'MagpieData mean MeltingT',
              'MagpieData maximum Electronegativity',
              'MagpieData mean Electronegativity',
              'MagpieData avg_dev Electronegativity',
              'MagpieData avg_dev NUnfilled',
              'MagpieData mean GSvolume_pa',
              'sine coulomb matrix eig 0',
              'sine coulomb matrix eig 6',
              'sine coulomb matrix eig 7'],
 'featurizers': {'bandstructure': [BandFeaturizer()],
                 'composition': [ElementProperty(data_source=<matminer.utils.data.MagpieData object at 0x7f92058afaf0>,
                features=['Number', 'MendeleevNumber', 'AtomicWeight',
                          'MeltingT', 'Column', 'Row', 'CovalentRadius',
                          'Electronegativity', 'NsValence', 'NpValence',
                          'NdValence', 'NfValence', 'NValence', 'NsUnfilled',
                          'NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled',
                          'GSvolume_pa', 'GSbandgap', 'GSmagmom',
                          'SpaceGroupNumber'],
                stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev',
                       'mode'])],
                 'dos': [DOSFeaturizer()],
                 'structure': [SineCoulombMatrix()]},
 'ml_model': 'Pipeline(memory=Memory(location=/var/folders/x6/mzkjfgpx3m9cr_6mcy9759qw0000gn/T/tmps0ji7j_y/joblib),\n'
             "         steps=[('selectpercentile',\n"
             '                 SelectPercentile(percentile=23,\n'
             '                                  score_func=<function '
             'f_regression at 0x7f92217f2040>)),\n'
             "                ('robustscaler', RobustScaler()),\n"
             "                ('randomforestregressor',\n"
             '                 RandomForestRegressor(bootstrap=False, '
             'max_features=0.05,\n'
             '                                       min_samples_leaf=7, '
             'min_samples_split=5,\n'
             '                                       n_estimators=20))])'}

# Explain the MatPipe's internals more comprehensively

details = pipe.inspect(filename="MatPipe_predict_experimental_gap_from_composition_details.json")

print(details)
{'autofeaturizer': {'autofeaturizer': {'cache_src': None, 'preset': 'debug', 'featurizers': {'composition': [ElementProperty(data_source=<matminer.utils.data.MagpieData object at 0x7f92058afaf0>,
                features=['Number', 'MendeleevNumber', 'AtomicWeight',
                          'MeltingT', 'Column', 'Row', 'CovalentRadius',
                          'Electronegativity', 'NsValence', 'NpValence',
                          'NdValence', 'NfValence', 'NValence', 'NsUnfilled',
                          'NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled',
                          'GSvolume_pa', 'GSbandgap', 'GSmagmom',
                          'SpaceGroupNumber'],
                stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev',
                       'mode'])], 'structure': [SineCoulombMatrix()], 'bandstructure': [BandFeaturizer()], 'dos': [DOSFeaturizer()]}, 'exclude': [], 'functionalize': False, 'ignore_cols': [], 'fitted_input_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 2, 'samples': 3811}, 'converted_input_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 3, 'samples': 3811}, 'ignore_errors': True, 'drop_inputs': True, 'multiindex': False, 'do_precheck': True, 'n_jobs': 2, 'guess_oxistates': True, 'features': ['MagpieData minimum Number', 'MagpieData maximum Number', 'MagpieData range Number', 'MagpieData mean Number', 'MagpieData avg_dev Number', 'MagpieData mode Number', 'MagpieData minimum MendeleevNumber', 'MagpieData maximum MendeleevNumber', 'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber', 'MagpieData avg_dev MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData minimum AtomicWeight', 'MagpieData maximum AtomicWeight', 'MagpieData range AtomicWeight', 'MagpieData mean AtomicWeight', 'MagpieData avg_dev AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData minimum MeltingT', 'MagpieData maximum MeltingT', 'MagpieData range MeltingT', 'MagpieData mean MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData minimum Column', 'MagpieData maximum Column', 'MagpieData range Column', 'MagpieData mean Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData minimum Row', 'MagpieData maximum Row', 'MagpieData range Row', 'MagpieData mean Row', 'MagpieData avg_dev Row', 'MagpieData mode Row', 'MagpieData minimum CovalentRadius', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData mean CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData mode CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData maximum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mean Electronegativity', 'MagpieData avg_dev Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData minimum NsValence', 'MagpieData maximum NsValence', 'MagpieData range NsValence', 'MagpieData mean NsValence', 'MagpieData avg_dev NsValence', 'MagpieData mode NsValence', 'MagpieData minimum NpValence', 'MagpieData maximum NpValence', 'MagpieData range NpValence', 'MagpieData mean NpValence', 'MagpieData avg_dev NpValence', 'MagpieData mode NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData avg_dev NdValence', 'MagpieData mode NdValence', 'MagpieData minimum NfValence', 'MagpieData maximum NfValence', 'MagpieData range NfValence', 'MagpieData mean NfValence', 'MagpieData avg_dev NfValence', 'MagpieData mode NfValence', 'MagpieData minimum NValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData mean NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData minimum NsUnfilled', 'MagpieData maximum NsUnfilled', 'MagpieData range NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData mode NsUnfilled', 'MagpieData minimum NpUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData range NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData mode NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData range NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData avg_dev NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData minimum NUnfilled', 'MagpieData maximum NUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData avg_dev NUnfilled', 'MagpieData mode NUnfilled', 'MagpieData minimum GSvolume_pa', 'MagpieData maximum GSvolume_pa', 'MagpieData range GSvolume_pa', 'MagpieData mean GSvolume_pa', 'MagpieData avg_dev GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData minimum GSbandgap', 'MagpieData maximum GSbandgap', 'MagpieData range GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData avg_dev GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData avg_dev GSmagmom', 'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber', 'MagpieData maximum SpaceGroupNumber', 'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber', 'sine coulomb matrix eig 0', 'sine coulomb matrix eig 1', 'sine coulomb matrix eig 2', 'sine coulomb matrix eig 3', 'sine coulomb matrix eig 4', 'sine coulomb matrix eig 5', 'sine coulomb matrix eig 6', 'sine coulomb matrix eig 7', 'sine coulomb matrix eig 8', 'sine coulomb matrix eig 9', 'sine coulomb matrix eig 10', 'sine coulomb matrix eig 11', 'sine coulomb matrix eig 12', 'sine coulomb matrix eig 13', 'sine coulomb matrix eig 14', 'sine coulomb matrix eig 15', 'sine coulomb matrix eig 16', 'sine coulomb matrix eig 17', 'sine coulomb matrix eig 18', 'sine coulomb matrix eig 19', 'sine coulomb matrix eig 20', 'sine coulomb matrix eig 21', 'sine coulomb matrix eig 22', 'sine coulomb matrix eig 23', 'sine coulomb matrix eig 24', 'sine coulomb matrix eig 25', 'sine coulomb matrix eig 26', 'sine coulomb matrix eig 27', 'sine coulomb matrix eig 28', 'sine coulomb matrix eig 29', 'sine coulomb matrix eig 30', 'sine coulomb matrix eig 31', 'sine coulomb matrix eig 32', 'sine coulomb matrix eig 33', 'sine coulomb matrix eig 34', 'sine coulomb matrix eig 35', 'sine coulomb matrix eig 36', 'sine coulomb matrix eig 37', 'sine coulomb matrix eig 38', 'sine coulomb matrix eig 39', 'sine coulomb matrix eig 40', 'sine coulomb matrix eig 41', 'sine coulomb matrix eig 42', 'sine coulomb matrix eig 43', 'sine coulomb matrix eig 44', 'sine coulomb matrix eig 45', 'sine coulomb matrix eig 46', 'sine coulomb matrix eig 47', 'sine coulomb matrix eig 48', 'sine coulomb matrix eig 49', 'sine coulomb matrix eig 50', 'sine coulomb matrix eig 51', 'sine coulomb matrix eig 52', 'sine coulomb matrix eig 53', 'sine coulomb matrix eig 54', 'sine coulomb matrix eig 55', 'sine coulomb matrix eig 56', 'sine coulomb matrix eig 57', 'sine coulomb matrix eig 58', 'sine coulomb matrix eig 59', 'sine coulomb matrix eig 60', 'sine coulomb matrix eig 61', 'sine coulomb matrix eig 62', 'sine coulomb matrix eig 63', 'sine coulomb matrix eig 64', 'sine coulomb matrix eig 65', 'sine coulomb matrix eig 66', 'sine coulomb matrix eig 67', 'sine coulomb matrix eig 68', 'sine coulomb matrix eig 69', 'sine coulomb matrix eig 70', 'sine coulomb matrix eig 71', 'sine coulomb matrix eig 72', 'sine coulomb matrix eig 73', 'sine coulomb matrix eig 74', 'sine coulomb matrix eig 75', 'sine coulomb matrix eig 76', 'sine coulomb matrix eig 77', 'sine coulomb matrix eig 78', 'sine coulomb matrix eig 79', 'sine coulomb matrix eig 80', 'sine coulomb matrix eig 81', 'sine coulomb matrix eig 82', 'sine coulomb matrix eig 83', 'sine coulomb matrix eig 84', 'sine coulomb matrix eig 85', 'sine coulomb matrix eig 86', 'sine coulomb matrix eig 87', 'sine coulomb matrix eig 88', 'sine coulomb matrix eig 89', 'sine coulomb matrix eig 90', 'sine coulomb matrix eig 91', 'sine coulomb matrix eig 92', 'sine coulomb matrix eig 93', 'sine coulomb matrix eig 94', 'sine coulomb matrix eig 95', 'sine coulomb matrix eig 96', 'sine coulomb matrix eig 97', 'sine coulomb matrix eig 98', 'sine coulomb matrix eig 99', 'sine coulomb matrix eig 100', 'sine coulomb matrix eig 101', 'sine coulomb matrix eig 102', 'sine coulomb matrix eig 103', 'sine coulomb matrix eig 104', 'sine coulomb matrix eig 105', 'sine coulomb matrix eig 106', 'sine coulomb matrix eig 107', 'sine coulomb matrix eig 108', 'sine coulomb matrix eig 109', 'sine coulomb matrix eig 110', 'sine coulomb matrix eig 111', 'sine coulomb matrix eig 112', 'sine coulomb matrix eig 113', 'sine coulomb matrix eig 114', 'sine coulomb matrix eig 115', 'sine coulomb matrix eig 116', 'sine coulomb matrix eig 117', 'sine coulomb matrix eig 118', 'sine coulomb matrix eig 119', 'sine coulomb matrix eig 120', 'sine coulomb matrix eig 121', 'sine coulomb matrix eig 122', 'sine coulomb matrix eig 123', 'sine coulomb matrix eig 124', 'sine coulomb matrix eig 125', 'sine coulomb matrix eig 126', 'sine coulomb matrix eig 127', 'sine coulomb matrix eig 128', 'sine coulomb matrix eig 129', 'sine coulomb matrix eig 130', 'sine coulomb matrix eig 131', 'sine coulomb matrix eig 132', 'sine coulomb matrix eig 133', 'sine coulomb matrix eig 134', 'sine coulomb matrix eig 135', 'sine coulomb matrix eig 136', 'sine coulomb matrix eig 137', 'sine coulomb matrix eig 138', 'sine coulomb matrix eig 139', 'sine coulomb matrix eig 140', 'sine coulomb matrix eig 141', 'sine coulomb matrix eig 142', 'sine coulomb matrix eig 143', 'sine coulomb matrix eig 144', 'sine coulomb matrix eig 145', 'sine coulomb matrix eig 146', 'sine coulomb matrix eig 147', 'sine coulomb matrix eig 148', 'sine coulomb matrix eig 149', 'sine coulomb matrix eig 150', 'sine coulomb matrix eig 151', 'sine coulomb matrix eig 152', 'sine coulomb matrix eig 153', 'sine coulomb matrix eig 154', 'sine coulomb matrix eig 155', 'sine coulomb matrix eig 156', 'sine coulomb matrix eig 157', 'sine coulomb matrix eig 158', 'sine coulomb matrix eig 159', 'sine coulomb matrix eig 160', 'sine coulomb matrix eig 161', 'sine coulomb matrix eig 162', 'sine coulomb matrix eig 163', 'sine coulomb matrix eig 164', 'sine coulomb matrix eig 165', 'sine coulomb matrix eig 166', 'sine coulomb matrix eig 167', 'sine coulomb matrix eig 168', 'sine coulomb matrix eig 169', 'sine coulomb matrix eig 170', 'sine coulomb matrix eig 171', 'sine coulomb matrix eig 172', 'sine coulomb matrix eig 173', 'sine coulomb matrix eig 174', 'sine coulomb matrix eig 175', 'sine coulomb matrix eig 176', 'sine coulomb matrix eig 177', 'sine coulomb matrix eig 178', 'sine coulomb matrix eig 179', 'sine coulomb matrix eig 180', 'sine coulomb matrix eig 181', 'sine coulomb matrix eig 182', 'sine coulomb matrix eig 183', 'sine coulomb matrix eig 184', 'sine coulomb matrix eig 185', 'sine coulomb matrix eig 186', 'sine coulomb matrix eig 187', 'sine coulomb matrix eig 188', 'sine coulomb matrix eig 189', 'sine coulomb matrix eig 190', 'sine coulomb matrix eig 191', 'sine coulomb matrix eig 192', 'sine coulomb matrix eig 193', 'sine coulomb matrix eig 194', 'sine coulomb matrix eig 195', 'sine coulomb matrix eig 196', 'sine coulomb matrix eig 197', 'sine coulomb matrix eig 198', 'sine coulomb matrix eig 199', 'sine coulomb matrix eig 200', 'sine coulomb matrix eig 201', 'sine coulomb matrix eig 202', 'sine coulomb matrix eig 203', 'sine coulomb matrix eig 204', 'sine coulomb matrix eig 205', 'sine coulomb matrix eig 206', 'sine coulomb matrix eig 207', 'sine coulomb matrix eig 208', 'sine coulomb matrix eig 209', 'sine coulomb matrix eig 210', 'sine coulomb matrix eig 211', 'sine coulomb matrix eig 212', 'sine coulomb matrix eig 213', 'sine coulomb matrix eig 214', 'sine coulomb matrix eig 215', 'sine coulomb matrix eig 216', 'sine coulomb matrix eig 217', 'sine coulomb matrix eig 218', 'sine coulomb matrix eig 219', 'sine coulomb matrix eig 220', 'sine coulomb matrix eig 221', 'sine coulomb matrix eig 222', 'sine coulomb matrix eig 223', 'sine coulomb matrix eig 224', 'sine coulomb matrix eig 225', 'sine coulomb matrix eig 226', 'sine coulomb matrix eig 227', 'sine coulomb matrix eig 228', 'sine coulomb matrix eig 229', 'sine coulomb matrix eig 230', 'sine coulomb matrix eig 231', 'sine coulomb matrix eig 232', 'sine coulomb matrix eig 233', 'sine coulomb matrix eig 234', 'sine coulomb matrix eig 235', 'sine coulomb matrix eig 236', 'sine coulomb matrix eig 237', 'sine coulomb matrix eig 238', 'sine coulomb matrix eig 239', 'sine coulomb matrix eig 240', 'sine coulomb matrix eig 241', 'sine coulomb matrix eig 242', 'sine coulomb matrix eig 243', 'sine coulomb matrix eig 244', 'sine coulomb matrix eig 245', 'sine coulomb matrix eig 246', 'sine coulomb matrix eig 247', 'sine coulomb matrix eig 248', 'sine coulomb matrix eig 249', 'sine coulomb matrix eig 250', 'sine coulomb matrix eig 251', 'sine coulomb matrix eig 252', 'sine coulomb matrix eig 253', 'sine coulomb matrix eig 254', 'sine coulomb matrix eig 255', 'sine coulomb matrix eig 256', 'sine coulomb matrix eig 257', 'sine coulomb matrix eig 258', 'sine coulomb matrix eig 259', 'sine coulomb matrix eig 260', 'sine coulomb matrix eig 261', 'sine coulomb matrix eig 262', 'sine coulomb matrix eig 263', 'sine coulomb matrix eig 264', 'sine coulomb matrix eig 265', 'sine coulomb matrix eig 266', 'sine coulomb matrix eig 267', 'sine coulomb matrix eig 268', 'sine coulomb matrix eig 269', 'sine coulomb matrix eig 270', 'sine coulomb matrix eig 271', 'sine coulomb matrix eig 272', 'sine coulomb matrix eig 273', 'sine coulomb matrix eig 274', 'sine coulomb matrix eig 275', 'sine coulomb matrix eig 276', 'sine coulomb matrix eig 277', 'sine coulomb matrix eig 278', 'sine coulomb matrix eig 279', 'sine coulomb matrix eig 280', 'sine coulomb matrix eig 281', 'sine coulomb matrix eig 282', 'sine coulomb matrix eig 283', 'sine coulomb matrix eig 284', 'sine coulomb matrix eig 285', 'sine coulomb matrix eig 286', 'sine coulomb matrix eig 287'], 'auto_featurizer': True, 'removed_featurizers': [], 'composition_col': 'composition', 'structure_col': 'structure', 'bandstruct_col': 'bandstructure', 'dos_col': 'dos', 'is_fit': True, 'fittable_fcls': {'BagofBonds', 'PartialRadialDistributionFunction', 'BondFractions'}, 'needs_fit': False, 'min_precheck_frac': 0.9}}, 'cleaner': {'cleaner': {'max_na_frac': 0.01, 'feature_na_method': 'drop', 'encoder': 'one-hot', 'encode_categories': True, 'drop_na_targets': True, 'na_method_fit': 'drop', 'na_method_transform': 'fill', 'dropped_features': [], 'object_cols': [], 'number_cols': ['MagpieData minimum Number', 'MagpieData maximum Number', 'MagpieData range Number', 'MagpieData mean Number', 'MagpieData avg_dev Number', 'MagpieData mode Number', 'MagpieData minimum MendeleevNumber', 'MagpieData maximum MendeleevNumber', 'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber', 'MagpieData avg_dev MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData minimum AtomicWeight', 'MagpieData maximum AtomicWeight', 'MagpieData range AtomicWeight', 'MagpieData mean AtomicWeight', 'MagpieData avg_dev AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData minimum MeltingT', 'MagpieData maximum MeltingT', 'MagpieData range MeltingT', 'MagpieData mean MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData minimum Column', 'MagpieData maximum Column', 'MagpieData range Column', 'MagpieData mean Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData minimum Row', 'MagpieData maximum Row', 'MagpieData range Row', 'MagpieData mean Row', 'MagpieData avg_dev Row', 'MagpieData mode Row', 'MagpieData minimum CovalentRadius', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData mean CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData mode CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData maximum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mean Electronegativity', 'MagpieData avg_dev Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData minimum NsValence', 'MagpieData maximum NsValence', 'MagpieData range NsValence', 'MagpieData mean NsValence', 'MagpieData avg_dev NsValence', 'MagpieData mode NsValence', 'MagpieData minimum NpValence', 'MagpieData maximum NpValence', 'MagpieData range NpValence', 'MagpieData mean NpValence', 'MagpieData avg_dev NpValence', 'MagpieData mode NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData avg_dev NdValence', 'MagpieData mode NdValence', 'MagpieData minimum NfValence', 'MagpieData maximum NfValence', 'MagpieData range NfValence', 'MagpieData mean NfValence', 'MagpieData avg_dev NfValence', 'MagpieData mode NfValence', 'MagpieData minimum NValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData mean NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData minimum NsUnfilled', 'MagpieData maximum NsUnfilled', 'MagpieData range NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData mode NsUnfilled', 'MagpieData minimum NpUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData range NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData mode NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData range NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData avg_dev NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData minimum NUnfilled', 'MagpieData maximum NUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData avg_dev NUnfilled', 'MagpieData mode NUnfilled', 'MagpieData minimum GSvolume_pa', 'MagpieData maximum GSvolume_pa', 'MagpieData range GSvolume_pa', 'MagpieData mean GSvolume_pa', 'MagpieData avg_dev GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData minimum GSbandgap', 'MagpieData maximum GSbandgap', 'MagpieData range GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData avg_dev GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData avg_dev GSmagmom', 'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber', 'MagpieData maximum SpaceGroupNumber', 'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber', 'sine coulomb matrix eig 0', 'sine coulomb matrix eig 1', 'sine coulomb matrix eig 2', 'sine coulomb matrix eig 3', 'sine coulomb matrix eig 4', 'sine coulomb matrix eig 5', 'sine coulomb matrix eig 6', 'sine coulomb matrix eig 7', 'sine coulomb matrix eig 8', 'sine coulomb matrix eig 9', 'sine coulomb matrix eig 10', 'sine coulomb matrix eig 11', 'sine coulomb matrix eig 12', 'sine coulomb matrix eig 13', 'sine coulomb matrix eig 14', 'sine coulomb matrix eig 15', 'sine coulomb matrix eig 16', 'sine coulomb matrix eig 17', 'sine coulomb matrix eig 18', 'sine coulomb matrix eig 19', 'sine coulomb matrix eig 20', 'sine coulomb matrix eig 21', 'sine coulomb matrix eig 22', 'sine coulomb matrix eig 23', 'sine coulomb matrix eig 24', 'sine coulomb matrix eig 25', 'sine coulomb matrix eig 26', 'sine coulomb matrix eig 27', 'sine coulomb matrix eig 28', 'sine coulomb matrix eig 29', 'sine coulomb matrix eig 30', 'sine coulomb matrix eig 31', 'sine coulomb matrix eig 32', 'sine coulomb matrix eig 33', 'sine coulomb matrix eig 34', 'sine coulomb matrix eig 35', 'sine coulomb matrix eig 36', 'sine coulomb matrix eig 37', 'sine coulomb matrix eig 38', 'sine coulomb matrix eig 39', 'sine coulomb matrix eig 40', 'sine coulomb matrix eig 41', 'sine coulomb matrix eig 42', 'sine coulomb matrix eig 43', 'sine coulomb matrix eig 44', 'sine coulomb matrix eig 45', 'sine coulomb matrix eig 46', 'sine coulomb matrix eig 47', 'sine coulomb matrix eig 48', 'sine coulomb matrix eig 49', 'sine coulomb matrix eig 50', 'sine coulomb matrix eig 51', 'sine coulomb matrix eig 52', 'sine coulomb matrix eig 53', 'sine coulomb matrix eig 54', 'sine coulomb matrix eig 55', 'sine coulomb matrix eig 56', 'sine coulomb matrix eig 57', 'sine coulomb matrix eig 58', 'sine coulomb matrix eig 59', 'sine coulomb matrix eig 60', 'sine coulomb matrix eig 61', 'sine coulomb matrix eig 62', 'sine coulomb matrix eig 63', 'sine coulomb matrix eig 64', 'sine coulomb matrix eig 65', 'sine coulomb matrix eig 66', 'sine coulomb matrix eig 67', 'sine coulomb matrix eig 68', 'sine coulomb matrix eig 69', 'sine coulomb matrix eig 70', 'sine coulomb matrix eig 71', 'sine coulomb matrix eig 72', 'sine coulomb matrix eig 73', 'sine coulomb matrix eig 74', 'sine coulomb matrix eig 75', 'sine coulomb matrix eig 76', 'sine coulomb matrix eig 77', 'sine coulomb matrix eig 78', 'sine coulomb matrix eig 79', 'sine coulomb matrix eig 80', 'sine coulomb matrix eig 81', 'sine coulomb matrix eig 82', 'sine coulomb matrix eig 83', 'sine coulomb matrix eig 84', 'sine coulomb matrix eig 85', 'sine coulomb matrix eig 86', 'sine coulomb matrix eig 87', 'sine coulomb matrix eig 88', 'sine coulomb matrix eig 89', 'sine coulomb matrix eig 90', 'sine coulomb matrix eig 91', 'sine coulomb matrix eig 92', 'sine coulomb matrix eig 93', 'sine coulomb matrix eig 94', 'sine coulomb matrix eig 95', 'sine coulomb matrix eig 96', 'sine coulomb matrix eig 97', 'sine coulomb matrix eig 98', 'sine coulomb matrix eig 99', 'sine coulomb matrix eig 100', 'sine coulomb matrix eig 101', 'sine coulomb matrix eig 102', 'sine coulomb matrix eig 103', 'sine coulomb matrix eig 104', 'sine coulomb matrix eig 105', 'sine coulomb matrix eig 106', 'sine coulomb matrix eig 107', 'sine coulomb matrix eig 108', 'sine coulomb matrix eig 109', 'sine coulomb matrix eig 110', 'sine coulomb matrix eig 111', 'sine coulomb matrix eig 112', 'sine coulomb matrix eig 113', 'sine coulomb matrix eig 114', 'sine coulomb matrix eig 115', 'sine coulomb matrix eig 116', 'sine coulomb matrix eig 117', 'sine coulomb matrix eig 118', 'sine coulomb matrix eig 119', 'sine coulomb matrix eig 120', 'sine coulomb matrix eig 121', 'sine coulomb matrix eig 122', 'sine coulomb matrix eig 123', 'sine coulomb matrix eig 124', 'sine coulomb matrix eig 125', 'sine coulomb matrix eig 126', 'sine coulomb matrix eig 127', 'sine coulomb matrix eig 128', 'sine coulomb matrix eig 129', 'sine coulomb matrix eig 130', 'sine coulomb matrix eig 131', 'sine coulomb matrix eig 132', 'sine coulomb matrix eig 133', 'sine coulomb matrix eig 134', 'sine coulomb matrix eig 135', 'sine coulomb matrix eig 136', 'sine coulomb matrix eig 137', 'sine coulomb matrix eig 138', 'sine coulomb matrix eig 139', 'sine coulomb matrix eig 140', 'sine coulomb matrix eig 141', 'sine coulomb matrix eig 142', 'sine coulomb matrix eig 143', 'sine coulomb matrix eig 144', 'sine coulomb matrix eig 145', 'sine coulomb matrix eig 146', 'sine coulomb matrix eig 147', 'sine coulomb matrix eig 148', 'sine coulomb matrix eig 149', 'sine coulomb matrix eig 150', 'sine coulomb matrix eig 151', 'sine coulomb matrix eig 152', 'sine coulomb matrix eig 153', 'sine coulomb matrix eig 154', 'sine coulomb matrix eig 155', 'sine coulomb matrix eig 156', 'sine coulomb matrix eig 157', 'sine coulomb matrix eig 158', 'sine coulomb matrix eig 159', 'sine coulomb matrix eig 160', 'sine coulomb matrix eig 161', 'sine coulomb matrix eig 162', 'sine coulomb matrix eig 163', 'sine coulomb matrix eig 164', 'sine coulomb matrix eig 165', 'sine coulomb matrix eig 166', 'sine coulomb matrix eig 167', 'sine coulomb matrix eig 168', 'sine coulomb matrix eig 169', 'sine coulomb matrix eig 170', 'sine coulomb matrix eig 171', 'sine coulomb matrix eig 172', 'sine coulomb matrix eig 173', 'sine coulomb matrix eig 174', 'sine coulomb matrix eig 175', 'sine coulomb matrix eig 176', 'sine coulomb matrix eig 177', 'sine coulomb matrix eig 178', 'sine coulomb matrix eig 179', 'sine coulomb matrix eig 180', 'sine coulomb matrix eig 181', 'sine coulomb matrix eig 182', 'sine coulomb matrix eig 183', 'sine coulomb matrix eig 184', 'sine coulomb matrix eig 185', 'sine coulomb matrix eig 186', 'sine coulomb matrix eig 187', 'sine coulomb matrix eig 188', 'sine coulomb matrix eig 189', 'sine coulomb matrix eig 190', 'sine coulomb matrix eig 191', 'sine coulomb matrix eig 192', 'sine coulomb matrix eig 193', 'sine coulomb matrix eig 194', 'sine coulomb matrix eig 195', 'sine coulomb matrix eig 196', 'sine coulomb matrix eig 197', 'sine coulomb matrix eig 198', 'sine coulomb matrix eig 199', 'sine coulomb matrix eig 200', 'sine coulomb matrix eig 201', 'sine coulomb matrix eig 202', 'sine coulomb matrix eig 203', 'sine coulomb matrix eig 204', 'sine coulomb matrix eig 205', 'sine coulomb matrix eig 206', 'sine coulomb matrix eig 207', 'sine coulomb matrix eig 208', 'sine coulomb matrix eig 209', 'sine coulomb matrix eig 210', 'sine coulomb matrix eig 211', 'sine coulomb matrix eig 212', 'sine coulomb matrix eig 213', 'sine coulomb matrix eig 214', 'sine coulomb matrix eig 215', 'sine coulomb matrix eig 216', 'sine coulomb matrix eig 217', 'sine coulomb matrix eig 218', 'sine coulomb matrix eig 219', 'sine coulomb matrix eig 220', 'sine coulomb matrix eig 221', 'sine coulomb matrix eig 222', 'sine coulomb matrix eig 223', 'sine coulomb matrix eig 224', 'sine coulomb matrix eig 225', 'sine coulomb matrix eig 226', 'sine coulomb matrix eig 227', 'sine coulomb matrix eig 228', 'sine coulomb matrix eig 229', 'sine coulomb matrix eig 230', 'sine coulomb matrix eig 231', 'sine coulomb matrix eig 232', 'sine coulomb matrix eig 233', 'sine coulomb matrix eig 234', 'sine coulomb matrix eig 235', 'sine coulomb matrix eig 236', 'sine coulomb matrix eig 237', 'sine coulomb matrix eig 238', 'sine coulomb matrix eig 239', 'sine coulomb matrix eig 240', 'sine coulomb matrix eig 241', 'sine coulomb matrix eig 242', 'sine coulomb matrix eig 243', 'sine coulomb matrix eig 244', 'sine coulomb matrix eig 245', 'sine coulomb matrix eig 246', 'sine coulomb matrix eig 247', 'sine coulomb matrix eig 248', 'sine coulomb matrix eig 249', 'sine coulomb matrix eig 250', 'sine coulomb matrix eig 251', 'sine coulomb matrix eig 252', 'sine coulomb matrix eig 253', 'sine coulomb matrix eig 254', 'sine coulomb matrix eig 255', 'sine coulomb matrix eig 256', 'sine coulomb matrix eig 257', 'sine coulomb matrix eig 258', 'sine coulomb matrix eig 259', 'sine coulomb matrix eig 260', 'sine coulomb matrix eig 261', 'sine coulomb matrix eig 262', 'sine coulomb matrix eig 263', 'sine coulomb matrix eig 264', 'sine coulomb matrix eig 265', 'sine coulomb matrix eig 266', 'sine coulomb matrix eig 267', 'sine coulomb matrix eig 268', 'sine coulomb matrix eig 269', 'sine coulomb matrix eig 270', 'sine coulomb matrix eig 271', 'sine coulomb matrix eig 272', 'sine coulomb matrix eig 273', 'sine coulomb matrix eig 274', 'sine coulomb matrix eig 275', 'sine coulomb matrix eig 276', 'sine coulomb matrix eig 277', 'sine coulomb matrix eig 278', 'sine coulomb matrix eig 279', 'sine coulomb matrix eig 280', 'sine coulomb matrix eig 281', 'sine coulomb matrix eig 282', 'sine coulomb matrix eig 283', 'sine coulomb matrix eig 284', 'sine coulomb matrix eig 285', 'sine coulomb matrix eig 286', 'sine coulomb matrix eig 287'], 'fitted_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 421, 'samples': 3811}, 'fitted_target': 'n', 'dropped_samples': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 421, 'samples': 0}, 'max_problem_col_warning_threshold': 0.3, 'warnings': [], 'is_fit': True}}, 'reducer': {'reducer': {'reducers': ('corr', 'tree'), 'corr_threshold': 0.95, 'n_pca_features': 'auto', 'tree_importance_percentile': 0.9, 'n_rebate_features': 0.3, '_keep_features': [], '_remove_features': [], 'removed_features': {'corr': ['MagpieData range Number', 'MagpieData mean Number', 'MagpieData avg_dev Number', 'MagpieData minimum MendeleevNumber', 'MagpieData minimum AtomicWeight', 'MagpieData maximum AtomicWeight', 'MagpieData mean AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData maximum MeltingT', 'MagpieData minimum Column', 'MagpieData range NsValence', 'MagpieData mean NsValence', 'MagpieData avg_dev NsValence', 'MagpieData range NfValence', 'MagpieData minimum NsUnfilled', 'MagpieData range NsUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData range NdUnfilled', 'MagpieData avg_dev NdUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData maximum GSvolume_pa', 'MagpieData range GSbandgap', 'MagpieData avg_dev GSbandgap', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData avg_dev GSmagmom', 'sine coulomb matrix eig 11', 'sine coulomb matrix eig 15', 'sine coulomb matrix eig 17', 'sine coulomb matrix eig 21', 'sine coulomb matrix eig 22', 'sine coulomb matrix eig 23', 'sine coulomb matrix eig 25', 'sine coulomb matrix eig 26', 'sine coulomb matrix eig 27', 'sine coulomb matrix eig 28', 'sine coulomb matrix eig 29', 'sine coulomb matrix eig 30', 'sine coulomb matrix eig 32', 'sine coulomb matrix eig 33', 'sine coulomb matrix eig 35', 'sine coulomb matrix eig 37', 'sine coulomb matrix eig 38', 'sine coulomb matrix eig 39', 'sine coulomb matrix eig 40', 'sine coulomb matrix eig 43', 'sine coulomb matrix eig 44', 'sine coulomb matrix eig 45', 'sine coulomb matrix eig 47', 'sine coulomb matrix eig 48', 'sine coulomb matrix eig 49', 'sine coulomb matrix eig 50', 'sine coulomb matrix eig 51', 'sine coulomb matrix eig 52', 'sine coulomb matrix eig 53', 'sine coulomb matrix eig 54', 'sine coulomb matrix eig 55', 'sine coulomb matrix eig 56', 'sine coulomb matrix eig 57', 'sine coulomb matrix eig 58', 'sine coulomb matrix eig 59', 'sine coulomb matrix eig 60', 'sine coulomb matrix eig 61', 'sine coulomb matrix eig 62', 'sine coulomb matrix eig 63', 'sine coulomb matrix eig 64', 'sine coulomb matrix eig 65', 'sine coulomb matrix eig 66', 'sine coulomb matrix eig 67', 'sine coulomb matrix eig 68', 'sine coulomb matrix eig 70', 'sine coulomb matrix eig 71', 'sine coulomb matrix eig 72', 'sine coulomb matrix eig 73', 'sine coulomb matrix eig 74', 'sine coulomb matrix eig 76', 'sine coulomb matrix eig 77', 'sine coulomb matrix eig 78', 'sine coulomb matrix eig 79', 'sine coulomb matrix eig 80', 'sine coulomb matrix eig 83', 'sine coulomb matrix eig 84', 'sine coulomb matrix eig 86', 'sine coulomb matrix eig 87', 'sine coulomb matrix eig 88', 'sine coulomb matrix eig 89', 'sine coulomb matrix eig 90', 'sine coulomb matrix eig 91', 'sine coulomb matrix eig 92', 'sine coulomb matrix eig 93', 'sine coulomb matrix eig 94', 'sine coulomb matrix eig 95', 'sine coulomb matrix eig 96', 'sine coulomb matrix eig 97', 'sine coulomb matrix eig 98', 'sine coulomb matrix eig 100', 'sine coulomb matrix eig 101', 'sine coulomb matrix eig 102', 'sine coulomb matrix eig 103', 'sine coulomb matrix eig 104', 'sine coulomb matrix eig 105', 'sine coulomb matrix eig 106', 'sine coulomb matrix eig 107', 'sine coulomb matrix eig 108', 'sine coulomb matrix eig 109', 'sine coulomb matrix eig 110', 'sine coulomb matrix eig 111', 'sine coulomb matrix eig 112', 'sine coulomb matrix eig 113', 'sine coulomb matrix eig 114', 'sine coulomb matrix eig 115', 'sine coulomb matrix eig 116', 'sine coulomb matrix eig 117', 'sine coulomb matrix eig 118', 'sine coulomb matrix eig 119', 'sine coulomb matrix eig 120', 'sine coulomb matrix eig 121', 'sine coulomb matrix eig 122', 'sine coulomb matrix eig 123', 'sine coulomb matrix eig 124', 'sine coulomb matrix eig 125', 'sine coulomb matrix eig 126', 'sine coulomb matrix eig 127', 'sine coulomb matrix eig 128', 'sine coulomb matrix eig 129', 'sine coulomb matrix eig 130', 'sine coulomb matrix eig 131', 'sine coulomb matrix eig 132', 'sine coulomb matrix eig 133', 'sine coulomb matrix eig 134', 'sine coulomb matrix eig 135', 'sine coulomb matrix eig 136', 'sine coulomb matrix eig 137', 'sine coulomb matrix eig 138', 'sine coulomb matrix eig 139', 'sine coulomb matrix eig 140', 'sine coulomb matrix eig 141', 'sine coulomb matrix eig 142', 'sine coulomb matrix eig 143', 'sine coulomb matrix eig 144', 'sine coulomb matrix eig 145', 'sine coulomb matrix eig 146', 'sine coulomb matrix eig 147', 'sine coulomb matrix eig 148', 'sine coulomb matrix eig 149', 'sine coulomb matrix eig 150', 'sine coulomb matrix eig 152', 'sine coulomb matrix eig 153', 'sine coulomb matrix eig 154', 'sine coulomb matrix eig 155', 'sine coulomb matrix eig 156', 'sine coulomb matrix eig 157', 'sine coulomb matrix eig 158', 'sine coulomb matrix eig 159', 'sine coulomb matrix eig 160', 'sine coulomb matrix eig 161', 'sine coulomb matrix eig 162', 'sine coulomb matrix eig 163', 'sine coulomb matrix eig 164', 'sine coulomb matrix eig 165', 'sine coulomb matrix eig 166', 'sine coulomb matrix eig 167', 'sine coulomb matrix eig 168', 'sine coulomb matrix eig 169', 'sine coulomb matrix eig 170', 'sine coulomb matrix eig 171', 'sine coulomb matrix eig 172', 'sine coulomb matrix eig 173', 'sine coulomb matrix eig 174', 'sine coulomb matrix eig 175', 'sine coulomb matrix eig 176', 'sine coulomb matrix eig 177', 'sine coulomb matrix eig 178', 'sine coulomb matrix eig 179', 'sine coulomb matrix eig 180', 'sine coulomb matrix eig 181', 'sine coulomb matrix eig 182', 'sine coulomb matrix eig 183', 'sine coulomb matrix eig 184', 'sine coulomb matrix eig 185', 'sine coulomb matrix eig 186', 'sine coulomb matrix eig 187', 'sine coulomb matrix eig 188', 'sine coulomb matrix eig 189', 'sine coulomb matrix eig 190', 'sine coulomb matrix eig 191', 'sine coulomb matrix eig 192', 'sine coulomb matrix eig 193', 'sine coulomb matrix eig 194', 'sine coulomb matrix eig 195', 'sine coulomb matrix eig 196', 'sine coulomb matrix eig 197', 'sine coulomb matrix eig 198', 'sine coulomb matrix eig 199', 'sine coulomb matrix eig 200', 'sine coulomb matrix eig 201', 'sine coulomb matrix eig 202', 'sine coulomb matrix eig 203', 'sine coulomb matrix eig 204', 'sine coulomb matrix eig 205', 'sine coulomb matrix eig 206', 'sine coulomb matrix eig 207', 'sine coulomb matrix eig 208', 'sine coulomb matrix eig 209', 'sine coulomb matrix eig 210', 'sine coulomb matrix eig 211', 'sine coulomb matrix eig 212', 'sine coulomb matrix eig 213', 'sine coulomb matrix eig 214', 'sine coulomb matrix eig 215', 'sine coulomb matrix eig 216', 'sine coulomb matrix eig 217', 'sine coulomb matrix eig 218', 'sine coulomb matrix eig 219', 'sine coulomb matrix eig 220', 'sine coulomb matrix eig 221', 'sine coulomb matrix eig 222', 'sine coulomb matrix eig 223', 'sine coulomb matrix eig 224', 'sine coulomb matrix eig 225', 'sine coulomb matrix eig 226', 'sine coulomb matrix eig 227', 'sine coulomb matrix eig 228', 'sine coulomb matrix eig 229', 'sine coulomb matrix eig 230', 'sine coulomb matrix eig 231', 'sine coulomb matrix eig 232', 'sine coulomb matrix eig 233', 'sine coulomb matrix eig 234', 'sine coulomb matrix eig 235', 'sine coulomb matrix eig 236', 'sine coulomb matrix eig 237', 'sine coulomb matrix eig 238', 'sine coulomb matrix eig 239', 'sine coulomb matrix eig 240', 'sine coulomb matrix eig 241', 'sine coulomb matrix eig 242', 'sine coulomb matrix eig 243', 'sine coulomb matrix eig 244', 'sine coulomb matrix eig 245', 'sine coulomb matrix eig 246', 'sine coulomb matrix eig 247', 'sine coulomb matrix eig 248', 'sine coulomb matrix eig 249', 'sine coulomb matrix eig 250', 'sine coulomb matrix eig 251', 'sine coulomb matrix eig 252', 'sine coulomb matrix eig 253', 'sine coulomb matrix eig 254', 'sine coulomb matrix eig 255', 'sine coulomb matrix eig 256', 'sine coulomb matrix eig 257', 'sine coulomb matrix eig 258', 'sine coulomb matrix eig 259', 'sine coulomb matrix eig 260', 'sine coulomb matrix eig 261', 'sine coulomb matrix eig 262', 'sine coulomb matrix eig 263', 'sine coulomb matrix eig 264', 'sine coulomb matrix eig 265', 'sine coulomb matrix eig 266', 'sine coulomb matrix eig 267', 'sine coulomb matrix eig 268', 'sine coulomb matrix eig 269', 'sine coulomb matrix eig 270', 'sine coulomb matrix eig 271', 'sine coulomb matrix eig 272', 'sine coulomb matrix eig 273', 'sine coulomb matrix eig 274', 'sine coulomb matrix eig 275', 'sine coulomb matrix eig 276', 'sine coulomb matrix eig 277', 'sine coulomb matrix eig 278', 'sine coulomb matrix eig 279', 'sine coulomb matrix eig 280', 'sine coulomb matrix eig 281', 'sine coulomb matrix eig 282', 'sine coulomb matrix eig 283', 'sine coulomb matrix eig 284', 'sine coulomb matrix eig 285', 'sine coulomb matrix eig 286', 'sine coulomb matrix eig 287'], 'tree': ['MagpieData minimum Number', 'MagpieData maximum Number', 'MagpieData mode Number', 'MagpieData maximum MendeleevNumber', 'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber', 'MagpieData avg_dev MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData minimum MeltingT', 'MagpieData range MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData maximum Column', 'MagpieData range Column', 'MagpieData mean Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData minimum Row', 'MagpieData maximum Row', 'MagpieData range Row', 'MagpieData mean Row', 'MagpieData avg_dev Row', 'MagpieData mode Row', 'MagpieData minimum CovalentRadius', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData mean CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData mode CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData minimum NsValence', 'MagpieData maximum NsValence', 'MagpieData mode NsValence', 'MagpieData minimum NpValence', 'MagpieData maximum NpValence', 'MagpieData range NpValence', 'MagpieData mean NpValence', 'MagpieData avg_dev NpValence', 'MagpieData mode NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData avg_dev NdValence', 'MagpieData mode NdValence', 'MagpieData minimum NfValence', 'MagpieData maximum NfValence', 'MagpieData mean NfValence', 'MagpieData avg_dev NfValence', 'MagpieData mode NfValence', 'MagpieData minimum NValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData mean NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData maximum NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData mode NsUnfilled', 'MagpieData minimum NpUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData range NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData mode NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData minimum NUnfilled', 'MagpieData maximum NUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData mode NUnfilled', 'MagpieData minimum GSvolume_pa', 'MagpieData range GSvolume_pa', 'MagpieData avg_dev GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData minimum GSbandgap', 'MagpieData maximum GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber', 'MagpieData maximum SpaceGroupNumber', 'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber', 'sine coulomb matrix eig 1', 'sine coulomb matrix eig 2', 'sine coulomb matrix eig 3', 'sine coulomb matrix eig 4', 'sine coulomb matrix eig 5', 'sine coulomb matrix eig 8', 'sine coulomb matrix eig 9', 'sine coulomb matrix eig 10', 'sine coulomb matrix eig 12', 'sine coulomb matrix eig 13', 'sine coulomb matrix eig 14', 'sine coulomb matrix eig 16', 'sine coulomb matrix eig 18', 'sine coulomb matrix eig 19', 'sine coulomb matrix eig 20', 'sine coulomb matrix eig 24', 'sine coulomb matrix eig 31', 'sine coulomb matrix eig 34', 'sine coulomb matrix eig 36', 'sine coulomb matrix eig 41', 'sine coulomb matrix eig 42', 'sine coulomb matrix eig 46', 'sine coulomb matrix eig 69', 'sine coulomb matrix eig 75', 'sine coulomb matrix eig 81', 'sine coulomb matrix eig 82', 'sine coulomb matrix eig 85', 'sine coulomb matrix eig 99', 'sine coulomb matrix eig 151']}, 'retained_features': ['sine coulomb matrix eig 6', 'MagpieData range AtomicWeight', 'MagpieData avg_dev NUnfilled', 'MagpieData mean GSvolume_pa', 'sine coulomb matrix eig 0', 'MagpieData mean MeltingT', 'MagpieData mean Electronegativity', 'MagpieData avg_dev AtomicWeight', 'MagpieData maximum Electronegativity', 'sine coulomb matrix eig 7', 'MagpieData avg_dev Electronegativity'], 'reducer_params': {'tree': {'importance_percentile': 0.9, 'mode': 'regression', 'random_state': 0}}, '_pca': None, '_pca_feats': None, 'is_fit': True}}, 'learner': {'learner': {'mode': 'regression', 'tpot_kwargs': {'max_time_mins': 1, 'max_eval_time_mins': 1, 'population_size': 10, 'n_jobs': 2, 'cv': 5, 'verbosity': 3, 'memory': 'auto', 'template': 'Selector-Transformer-Regressor', 'config_dict': {'sklearn.linear_model.ElasticNetCV': {'l1_ratio': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'tol': [1e-05, 0.0001, 0.001, 0.01, 0.1]}, 'sklearn.ensemble.ExtraTreesRegressor': {'n_estimators': [20, 100, 200, 500, 1000], 'max_features': array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3), 'bootstrap': [True, False]}, 'sklearn.ensemble.GradientBoostingRegressor': {'n_estimators': [20, 100, 200, 500, 1000], 'loss': ['ls', 'lad', 'huber', 'quantile'], 'learning_rate': [0.01, 0.1, 0.5, 1.0], 'max_depth': range(1, 11, 2), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3), 'subsample': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'alpha': [0.75, 0.8, 0.85, 0.9, 0.95, 0.99]}, 'sklearn.tree.DecisionTreeRegressor': {'max_depth': range(1, 11, 2), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3)}, 'sklearn.neighbors.KNeighborsRegressor': {'n_neighbors': range(1, 101), 'weights': ['uniform', 'distance'], 'p': [1, 2]}, 'sklearn.linear_model.LassoLarsCV': {'normalize': [True, False]}, 'sklearn.svm.LinearSVR': {'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'], 'dual': [True, False], 'tol': [1e-05, 0.0001, 0.001, 0.01, 0.1], 'C': [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 15.0, 20.0, 25.0], 'epsilon': [0.0001, 0.001, 0.01, 0.1, 1.0]}, 'sklearn.ensemble.RandomForestRegressor': {'n_estimators': [20, 100, 200, 500, 1000], 'max_features': array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3), 'bootstrap': [True, False]}, 'sklearn.linear_model.RidgeCV': {}, 'sklearn.preprocessing.Binarizer': {'threshold': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.decomposition.FastICA': {'tol': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.cluster.FeatureAgglomeration': {'linkage': ['ward', 'complete', 'average'], 'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']}, 'sklearn.preprocessing.MaxAbsScaler': {}, 'sklearn.preprocessing.MinMaxScaler': {}, 'sklearn.preprocessing.Normalizer': {'norm': ['l1', 'l2', 'max']}, 'sklearn.kernel_approximation.Nystroem': {'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'], 'gamma': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'n_components': range(1, 11)}, 'sklearn.decomposition.PCA': {'svd_solver': ['randomized'], 'iterated_power': range(1, 11)}, 'sklearn.preprocessing.PolynomialFeatures': {'degree': [2], 'include_bias': [False], 'interaction_only': [False]}, 'sklearn.kernel_approximation.RBFSampler': {'gamma': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}, 'sklearn.preprocessing.RobustScaler': {}, 'sklearn.preprocessing.StandardScaler': {}, 'tpot.builtins.ZeroCount': {}, 'tpot.builtins.OneHotEncoder': {'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25], 'sparse': [False], 'threshold': [10]}, 'sklearn.feature_selection.SelectFwe': {'alpha': array([0.   , 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008,
       0.009, 0.01 , 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017,
       0.018, 0.019, 0.02 , 0.021, 0.022, 0.023, 0.024, 0.025, 0.026,
       0.027, 0.028, 0.029, 0.03 , 0.031, 0.032, 0.033, 0.034, 0.035,
       0.036, 0.037, 0.038, 0.039, 0.04 , 0.041, 0.042, 0.043, 0.044,
       0.045, 0.046, 0.047, 0.048, 0.049]), 'score_func': {'sklearn.feature_selection.f_regression': None}}, 'sklearn.feature_selection.SelectPercentile': {'percentile': range(1, 100), 'score_func': {'sklearn.feature_selection.f_regression': None}}, 'sklearn.feature_selection.VarianceThreshold': {'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]}, 'sklearn.feature_selection.SelectFromModel': {'threshold': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ]), 'estimator': {'sklearn.ensemble.ExtraTreesRegressor': {'n_estimators': [100], 'max_features': array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95])}}}}, 'scoring': 'neg_mean_absolute_error'}, 'models': None, 'random_state': None, 'greater_score_is_better': None, '_fitted_target': 'n', '_backend': TPOTRegressor(config_dict={'sklearn.cluster.FeatureAgglomeration': {'affinity': ['euclidean',
                                                                                 'l1',
                                                                                 'l2',
                                                                                 'manhattan',
                                                                                 'cosine'],
                                                                    'linkage': ['ward',
                                                                                'complete',
                                                                                'average']},
                           'sklearn.decomposition.FastICA': {'tol': array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])},
                           'sklearn.decomposition.PCA': {'iterated_power'...
                           'tpot.builtins.OneHotEncoder': {'minimum_fraction': [0.05,
                                                                                0.1,
                                                                                0.15,
                                                                                0.2,
                                                                                0.25],
                                                           'sparse': [False],
                                                           'threshold': [10]},
                           'tpot.builtins.ZeroCount': {}},
              log_file=<ipykernel.iostream.OutStream object at 0x7f9228143730>,
              max_eval_time_mins=1, max_time_mins=1, memory='auto', n_jobs=2,
              population_size=10, scoring='neg_mean_absolute_error',
              template='Selector-Transformer-Regressor', verbosity=3), '_features': ['MagpieData range AtomicWeight', 'MagpieData avg_dev AtomicWeight', 'MagpieData mean MeltingT', 'MagpieData maximum Electronegativity', 'MagpieData mean Electronegativity', 'MagpieData avg_dev Electronegativity', 'MagpieData avg_dev NUnfilled', 'MagpieData mean GSvolume_pa', 'sine coulomb matrix eig 0', 'sine coulomb matrix eig 6', 'sine coulomb matrix eig 7'], 'from_serialized': False, '_best_models': None, 'is_fit': True}}, 'pre_fit_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 2, 'samples': 3811}, 'post_fit_df': {'obj': <class 'pandas.core.frame.DataFrame'>, 'columns': 12, 'samples': 3811}, 'ml_type': 'regression', 'target': 'n', 'version': '1.0.3.20191111', 'is_fit': True}

Access MatPipe's internal objects directly.

You can access MatPipe's internal objects directly, instead of via a text digest; you just need to know which attributes to access. See the online API docs or the source code for more info.

# Access some attributes of MatPipe directly, instead of via a text digest

print(pipe.learner.best_pipeline)
Pipeline(memory=Memory(location=/var/folders/x6/mzkjfgpx3m9cr_6mcy9759qw0000gn/T/tmps0ji7j_y/joblib),
         steps=[('selectpercentile',
                 SelectPercentile(percentile=23,
                                  score_func=<function f_regression at 0x7f92217f2040>)),
                ('robustscaler', RobustScaler()),
                ('randomforestregressor',
                 RandomForestRegressor(bootstrap=False, max_features=0.05,
                                       min_samples_leaf=7, min_samples_split=5,
                                       n_estimators=20))])

print(pipe.autofeaturizer.featurizers["composition"])
[ElementProperty(data_source=<matminer.utils.data.MagpieData object at 0x7f92058afaf0>,
                features=['Number', 'MendeleevNumber', 'AtomicWeight',
                          'MeltingT', 'Column', 'Row', 'CovalentRadius',
                          'Electronegativity', 'NsValence', 'NpValence',
                          'NdValence', 'NfValence', 'NValence', 'NsUnfilled',
                          'NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled',
                          'GSvolume_pa', 'GSbandgap', 'GSmagmom',
                          'SpaceGroupNumber'],
                stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev',
                       'mode'])]

print(pipe.autofeaturizer.featurizers["structure"])
[SineCoulombMatrix()]

Persistence of pipelines

Being able to reproduce your results is a crucial aspect of materials informatics. MatPipe provides methods for easily saving and loading entire pipelines for use by others.

Save a MatPipe for later with MatPipe.save. Load it with MatPipe.load.

filename = "MatPipe_predict_experimental_gap_from_composition.p"

pipe.save(filename)
pipe_loaded = MatPipe.load(filename)
2020-07-27 14:37:33 INFO     Loaded MatPipe from file MatPipe_predict_experimental_gap_from_composition.p.
2020-07-27 14:37:33 WARNING  Only use this model to make predictions (do not retrain!). Backend was serialzed as only the top model, not the full automl backend.